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Preface 


THE proposes stated for this book in the original edition have also guided 
its levision The basic com so in testing should piesent the piinoiples ui 
testing in such a way that the sLudcnt will learn to choose tests wisely fm 
particular needs, and will he awaie of the potentialities and limitations of 
the tests he chooses, We now have a large nninhei of genetal pnneiplcs of 
testing to aid in such evaluation and mteipiotulion. 

Psychological testing has been advanced chiefly by two lines of woik* 
one, the practical and clinical application of instalments, the other, the 
theoretical and mathematical analysis of testing pioblcrns. These two lines 
of thought have often remained independent, so that test iuteipietatious 
employed by clinicians and counselms frequently appeal unluisUvmthv 
when judged by psychotnoUic slandauls. (amveisely, the tlimeiun often 
finds the piecisely designed and nauowly formed Lasts that eome from the 
psychometric specialist unsatisfactory became they do not seive Ins puetnsd 
needs The cleavage between these two schools ol thought lus been induced 
during the past decade, on the one hand by the mcuused eoneem ol ehm- 
cians for the rigoious specification of hypotheses and validation, on the 
other hand by the bioademng ol psychometric tliemv to make a jilace fm 
tests designed for purposes olhei than piedielion oi a spec ihe euterum, '1 lus 
book views tests from both the piaetieul and technical pei spec lives, mi that 
the industrial, clinical, educational, or imlitmy psvrhologist will learn how 
the psychometiic specialist evaluates tests and the psychometric specialist 
will undeistand the practical requuements which tests must meet 

The hook is intended to seive the needs of undeigiadnutes and beginning 
graduate students in psychology and counseling, ft makes no attempt to 
exhaust any one of the fields of testing, lathei, it covers those essentials on 
which later study of such specialties as mdusUtal selection, clinical ease n\- 
tcrpietation, 01 test theory may he based 

Theie has been substantial change from the (list edition oi (he hook, 
though the broad outline and anus leniaiu the same ami most of the basic 


principles stand unchanged The past decade, has seen notable advances m 
testing and test thorny, including the Technical liccimmcndntiow lor psy¬ 
chological and achievement tests and the associated reformulation of cum 
cepts of validity, the extensive validation of differential aptitude batteries, 


XIX 



XX PREFACE 


and the decline of diagnostic pattern interpretation of the Wechsler intelli¬ 
gence scale from a widely accepted practice to a discredited hypothesis. 
One of the most striking changes m this period has been the improved 
quality of the information supplied by test publishers. Foi many tests, flimsy 
and inadequate manuals have been replaced by technical handbooks of 
monogiaph length, tlieieby increasing the importance of skill in liitcipietmg 
information about leliability, validity, and norms 
In my teaching, I place particular emphasis upon this skill, the pnncipal 
assignments being concerned with reviewing of tests and selecting tests for 
particular piograms (e.g, guidance of freshmen in a described hbeial aits 
college) The presentation of specific tests in this book is designed to assist 
m this function and not to substitute foi it Tests selected foi extended de¬ 
scription have wide application, illustrate nnpoitant techniques and types of 
evidence, or lllustiate significant principles The space given to a tesl is by 
no means an indication of its ment; perhaps the pnme detoiminci of inclu¬ 
sion has been the amount and vanety of relevant information available, 
which biases the selection toward oldei tests In ordci to introduce the 
student to a wider range of tests, a summaiy listing is given m many chap¬ 
ters This summaiy is piimarily a set of suggestions foi fuithei study. The 
annotation is too biief to seive as a eiitical levievv, and peilups cm lies 
favorable oi unfavorable connotations which 1 did not intend I have ac¬ 
cepted this usk m older to provide a pielirmmuy guide to the begninei lost 
in the moiass of test titles It is urged that the rcadei use the sumtnanes only 
to piepaie a list of tests to he studied fuither, beanng in mind that oven 
this summaiy cove is only a fraction of the tests on the market. A decision 
about the merit of a test must come aftei a study of the test manual and 
accompanying infoimation, the Buros yearbook leviews, and other sources 
A word about the leslriction of the listings to tests used in the United States 
is m older. So fai as the punciples which constitute the main content of 
the book are concerned, theie is no such lestuction The principles of testing 
are universal, as can be seen by comparing this book to such souiccs m the 
bibliography as Meili, Vernon, Laugier, and Pi6ron, and by the international 
status of such tests as the Bmet, the GATB, the MMPI, and the TAT. The 
differences between the psychometric and impressionistic testers within each 
country are fai gieater than any national diffeiences On the othei hand, the 
summary of specific mstiurnents is confined entuely to tests and editions cm 
die Ameiican market. This neglect of tests developed m otliei countucs is 
made necessaiy both by the facts of test distribution abioad, winch limit the 
use of any one test, and by the inability of a leviewer to comment fauly on 
tests with which lus acquaintance is remote 
The questions which stud the book are pait of the text, capitalizing on 
the fact that the mind profits most when it woiks as it leads By th inkin g 
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through the questions on each section, the reader sees how tin* principles 
apply and becomes awaie of topics which require fuither thought The 
questions do not always have specific unsweis Frequently they are de¬ 
liberately controversial, 01 can bo answered only by a qualified “Yes, but— 
The student who sees two sides to any of the questions can have considerable 
confidence that he is doing good thinking 
In accomplishing my pm poses, 1 have been gieatly aided by my profes¬ 
sional associations of tlie past ten yoais Paiticulaily bioadonmg weie the 
intimate association over a five-yeai pound with the Committees on Test 
Standaids of APA and olhei associations, the oppm limits to pursue research 
m test theoiy made possible by the Buieau of Educational Research of the 
Umveisity of Illinois and the Office of Naval Rcseauh. and the nppoitunity 
given me by the Office of Natal Reseaicli and the National Institutes of 
Health to become acquainted with test leseaich and applications at home 
and abroad My colleagues iu these ventures taught me much about tests, 
Howard 13 Lyman, Russell P Kropp, and Atuliew Raggalev gave sugges¬ 
tions foi the revision, and Jean W Macfarlanc’s euliusms of tins maimsaipt 
led to many impiovcments 1 wish also to give special thanks to the tepie- 
sentatives of various lost publishing houses, all of whom have been most 
cooperative in supplying inhumation about then tests and in helping me to 
clanfy my ideas As always, the students on whom these ideas have lieeu 
tiied weie a majoi soiuee of motivation and insight Mis I,estei M Eneml's 
seivices as typist of many drafts of the manuscript ate acknowledged with 
appreciation 

la i J. CnoMivt ii 

September, 1959 
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Who Uses Tests? 


THE testing movement stands as a prime example oi social science in ac¬ 
tion, since it touches on vital questions in all phases of our life. What is char¬ 
acter and what soits of clnldien have good chmaetei? What peisonahly 
make-up piomises that an adolescent will be a stable, ellective adult’’' How 
can we tell which 6-yeai-olds me leady to begin learning to hmcP Ts this 
young man a good piospect for tiaimng in watchmaking, m should he go 
into a diffeient vocation—say sleanifittmg 01 patteiuinaking? Suoli aie the 
pioblcms towaul which testing and icseaich on individual diffeicmcs are 
duected. In this book, we will smvey the methods vvlneli have been and aie 
being developed to solve these pioblcms 

TYPICAL TEST USERS 

One way to get a quick overview of the region we aie to explore is to find 
out what testers do By meeting a few of the people who umk with tests wo 
can get an impicssion of the vancty of seivices tests peiloim and oi the 1 way 
they lit into a psychological caicer The people to he desenbed aie imagi¬ 
nary, each one being a composite poitiait of many psychologists such as can 
he found m eveiy pait of the eounliy. 

Let’s begin by calling on Helen Kimball At about eleven on a January 
morning, wo find hoi at hoi desk m the ecmtial adnmnslialion building ol 
the school system ol Uiveitou, population 17,000, Miss Kimball is datk, at- 
tiactive, 35ish. Ilei position beais the title School Psychologist. 'I lie oKioo m 
which we find hei is unusually bughl, with decoiaiive piclmos, chapes, and 
a table low enough to accommodate a child On tlm table aie spiead several 
objects- blocks, a cutout puzzle, a iolclei ol pic.hues 

Miss Kimball apologizes foi the disoidci of the table as she gieets us "I 
just finished testing a boy and haven’t had time to clean up the mateiuls 
Usually I keep just a toy oi two on the table, to attiact the interest of any 
child sent down to see me These test materials aie fiom the Wechslei m- 
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telligence scale and a picture test for studying personality called the The¬ 
matic Appei ception Test,” When we express interest in her case, and in¬ 
quire about the reason for testing the boy, she outlines his background as 
follows, 

Chailes is a boy from a foreign home, middle-to-low economic status, who 
in the fourth grade suddenly is causing trouble after having been known as a 
friendly, successful pupil in other giades His teaehei repoits that he has 
made almost no progiess m school subjects since the start of the year, that he 
refuses hei attempts to give him extra help, and that he has begun to disturb 
the class by hitting other boys, taking objects from the girls to annoy them, 
and similar misdemeanois A check with the files showed that his previous 
teacheis had made many favoiable lepoits. “A fine woiker, Does everything 
a little better than most other boys.” “Learns new ideas quickly. Good at 
numbei woik ” But the objective tests given at the end of the third grade 
showed that ho was not supenoi, In fact, in reading compiehension Charles 
was two months behind die average pupil of his class, and in aiithmetic, his 
j best score, lie just reached the average. Probably the teachers were misled 
j by his cheeifulness and industry into overrating Ins past learning. 

“Now,” says Miss Kimball, “they asked me to try to delta mine the causes 
of his pioblem. Teachers in each school check most of the cases; for instance, 
they give intelligence tests and reading tests, and make studies of the chil¬ 
dren the school needs to know more about. Chailes was sent to mo be¬ 
cause the teacher felt his behavioi piesented an especially serious problem. 
The school did have a mental-test leeord, because Chailes’ class took the 
Kuhlmann-Andeison group intelligence lest two months ago. Charles’ IQ 
was only 65 But Ins teacher said Chailes wouldn’t work on tlio test. He did 
a few items, then stopped and looked out tile window; when she urged him 
to go ahead, he worked slowly, and seemed not to be trying, 

“So my fiist problem was to tiy to find out how bright Charles is, to learn 
/ what to expect of him. The Wechsler or the Stanford-Bmct is our usual meas- 
' uro. Since we give these tests individually, most childien cooperate well, 
When I gave the Wechsler this morning, Chailes did about as well as most 
10-year-olds; I haven’t computed his IQ yet, but from the impression I 
formed as I gave the test, it will come out about 90 to 100—just a trifle below 
average. Tire scoie might be affected by his schooling, as many of the ques¬ 
tions use language. The Peiformance section of the test, though, uses blocks, 
picture puzzles, and other tasks not likely to be affected by schooling, and 
he did about the same as on the Veibal section, appaiently language diffi¬ 
culties aren’t his big problem. I was pleased that he cooperated, since he’d 
had trouble befoie tie was eager to work, cheerful, and seemed pleased with 
Ins accomplishment. But of course we started out slowly, and I made a great 
effort to interest him in the ‘games ’ ° 
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“I did two other things with Chailes. Usually we don’t test so much in one 
day, but the school wants to make some decisions about Chailes at midyear 
So we broke off the testing and chatted awhile, then I gave him a vision test. 
I chose that because I noticed some squinting diumg the intelligence test, 
and the teacher had noted a few complaints of headaches. My vision tests 
aren’t as piecise as an oculist’s, but they showed a little deficiency in one eye. 
Worse, though, is Ins coordination; the eyes don't wmk togelhoi, 1ml instead 
look at slightly diffeient paits of the page when he is leading. This pi ninthly 
can be corrected, but well need finlhei visual tests to be sine. Poor visual 
cooidinatzon would cause trouble in leading and lead to ialigue 

"Since the emotional pioblem seemed to be seveie, judging from the re- 
poits of Charles’ social behavior, I used my pietiue test The child makes up 
stones about each pictuie, and the stones often leveal Ins wonies 1 and wishes 
Here’s one pictuic, foi example, showing a hoy huddled up m a comer 
Charles made up a stmy that the boy was playing with the otheis and they 
made him stop and go home. The olhei bovs said lie had a difleient way of 
playing that wasn’t light Scveial stones like that suggest tli.it Chailes is 
gieatly wonicd about losing his filends, and about 'being elide lent ' The test 
gives many other suggestions about Cluules’ pioblems, but I need to study 
the record before I form definite conclusions. 

“Our next steps will be to check on the vision pioblem and to clanfv the 
emotional difficulties. I’ll have scweuil conferences with Cluules, helping 
him talk out his difficulties Then we will see what c an be done to help him 
solve them. The fact that he has normal mental ability is eucomaging, since 
we know he can do well if his adjustment impiovcs. It will help to know that 
he is average lathei than superior, as past teucheis suggested Pci haps lie's 
had to live up to too high a lcpulation We may use 1 luilbei tests lalei, the. 
ones used so fai have nan owed oiu field of investigation, so that my con¬ 
ferences with Charles will be effective.” 

This sample gives some idea of Miss Kimball’s wink No two cases am just', 
alike, noi are the same tests appropriate foi oveiy case. In contiast to tins 
“clinical” approach is the wmk of a peisonnel numugei ioi a department 
stoic. This is a stoic with about 350 employees, ranging limn roustabouts In 
buyers and office personnel. Edward 111,ike, the jicisminel m.m.igei, is a 
heavy-set, graying man of 45, who seems interested m whatevei we have to 
say. But there is also a busknoss, a stickiug-lo-a-sdiedule. “The tontines of 
the job? I don’t do much testing myself, but I do intci view everybody we 
hue That helps the store, because evciy employee knows them's someone 
here m the office who has met him and to whom he can lake his pioblems, 

’When an applicant comes m, he fills out a personal-history blank, and my 
assistant, Miss Field, gives him a set of tests. The tests aren’t quite the same 
for everybody. We give all applicants a shoit multiple-choice intelligence 
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test, since diffeient jobs m tlie stole call foi employees of diffeient caliber. 
Most applicants get a test of simple anthmetic—addition, percentages, dis¬ 
counts, and so on. Foi package wiappeis and merchandise liandlcis, we use 
a simple test of motor ability in which they place wooden cubes in a boY as 
rapidly as possible, It doesn’t picdict who’ll be the best employee, but it 
saves us from some lemons. Foi a few depaitmcnts, we have liadc tests, 
tests of information about the job Some men claim to be shoe salesmen 
when they don’t know a last fiom a countei. These tests check on the experi¬ 
ence the applicant claims m Ins application blank 

“Whalevei tests Miss Field gives aie scoicd and lecoided on the applica¬ 
tion blank. Then, when tlieie is a vacancy, we pull out the names of people 
who have the qualifications that job lequnes I call in one or mm e oi these 
people, inteiview them, and if I think they’ll do, I line them. The tests aie 
most useful to soit out the good fiom the pool piospccts Miss Field can 
give the tests very easily, and it saves us a lot of lime we'd spend mini view¬ 
ing people who wouldn’t be good woikeis 01 com so. Miss Field does a nice 
job, making sure each poison knows we’ie inteiested m linn, and sending 
each one away with a feeling that he’s had Ian consideuriion ” 

Mr Blake, of couise, is a little diffeient fiom some othei peisonnel man¬ 
ager we might have talked to But 1ns woik is fauly typical ol that m busi¬ 
nesses having substantial tumovei 

Unlike Miss Kimball and Mr Blake, Max Samuels and Paul Slunulan aie 
using tests foi lcseaich which will have only distant piactical applications 
We find them ill the Psychology Laboialoiy of Atlicilon Univcisily on a July 
day, suiiounded by piles of test booklets Samuels gels up fiom a tape ic- 
coidei to which he has been listening and offcis lo show us mound the pioj- 
ect. 

“We’ie studying how people solve problems. When we give an ouhnaiy 
intelligence test, we see that people have many difficulties that seem lo have 
nothing to do with then bughtness. Sometimes a person becomes confused 
and makes the same mistake lluec oi foui times, even though he has alieady 
done haidci pioblems. Anolhei person may plan a solution lo a pioblom and 
cany out sevcial steps in an oiclcily way, hut when lie makes one eiror he 
loses his sense of dneclion and slips back into landom lual and onoi. We aie 
tiymg lo develop exact methods ol meastiimg these habitual ways of lead¬ 
ing to difficulties They aie important elements m pioblem solving, affecting 
scores on mental tests and also peifoimancc m piactical situations Intelli¬ 
gence tests, which measuie the person’s geneial level oi success, do not give 
accurate measuies of his mannei of pioblem solving 

“Sheridan and I are just beginning to exploie what the impoitani van- 
ables m pioblem solving may be We will spend a couple of yeais refining 
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our observation techniques befoie we aie ready to carry out fmmal studies. 
Both of us teach duiing the school ycai, but we spend about u quailtn of 
our time giving tests to students During the summci wo analyze the lec- 
oids, levisc the tests for the next tiyout, and take a few moie steps towaid 
a theory of pioblcm solving ” 

The fiisl test Samuels shows us uses the same blocks Miss Kimball had on 
her table. The poison tested is shown a mosaic design and asked to make the 
same design out of blocks “In the intelligence test,” Samuels says, “the score 
leports the numbei of designs completed within ccitam time limits. The 
tester may note casual observations of the kinds of emu the person makes, 
but he does not scoic them We are hying to obtain dependable semes indi¬ 
cating how systematically the peison attacks the problem, bow olten he ie- 
peats a mistake, and how long a time passes befoie lie notices a mistake 
Shendan gives the test m a 100 m with a huge nnnoi set m the wall The mir¬ 
ror is fixed so that one can sec though limn the back, I sit on that side and 
obseive every detail of what the subject does I dictate a u'coul into the tape 1 
recorder. We can listen to the tapes whencvci we wish, and sunk out the* 
natuie and time of each enoi. We have developed new designs winch make 
ceitam types of cuoi moie likely, and latei we hope to develop a simplei 
sconng method which will not lequue a tape iccmdei " 

Samuels shows several otliei tests using mazes, anagiams, and designs 
made by building up layeis of cutout eolmed stencils. “Om mam purpose," 
he says, “is to identify consistent pattoins winch the peison shows on many 
different pioblems These pattoins are the ones we expect him to cariy om 
when he wiilcs a theme m English m tries to identify an unknown substance 
in chemistiy ” 

We inqune about a piece ot appaialus with a ring of lights and a few 
pushbuttons, “This,” he says, “is an expeiimental test which pemuts us to 
present much longei and moie complex tasks than the usual puzzle It is used 
to measuie abilities ot high-level scientific and technical wmkeis, one needs 
veiy difficult tasks to sepiuale the best men m such a group. We aie using 
it with aveiage students because they make many enois, and mu main con¬ 
cern is to study the types ol men made by difleieut pci sons 'I lie appautus 
is wued so that it follows some simple mles These mles change with every 
pioblem Theie aie tlneo pushbuttons which turn on and off vanous com'. 
„binations of lights The poison’s task may be to turn on light number 8 only. 
He presses the buttons m Uun to find out what lights each button umtiols. 
For instance, when he piesses button 1, lights 3, 4, and 5 go on When lie lias 
all the information, he must find a sequence of actions which will leave only 
light 3 ht A problem of this type can be made veiy complicated, even a 
bught person takes thirty minutes on some of our pioblems. One interesting 
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feature of tta apparatus la rts automate recording. ^to 
presses a button, a record is punched on a teletype ape Tins tape 
be decoded to show just what the person did and when. 

Sheridan comes in at tins moment with an armload o£ boxes that 
te„rds for use in con.put.ng madnnes. His role ,» the pr.,teU,o 
explains, is to analyze die data after all the tests and records have lx on 
scored “The electronic computer has been a blessing to rescaic i i c * 
We obtain about 200 scores on every student we test, and 1 won t * ' _ 

ever to work out the relations on an ordinary calculator. But the clechomc 
machine gives us the answers m just a few hours. The catch is that sometimes 
it takes a month to put all tire records onto these cards. Every observation 
has to be reduced to numerical form before it can be heated statistically. 

“Our mam statistical method is factor analysis This helps us to separate 
the variables which affect only a single task from the ones winch shiny up 
consistently throughout the person’s performance. We also find out winch 
test scores give the best measuies of each vanable. The results so far suggest 
that we will eventually have dependable measuies of how pinsistent, how 
systematic, and how adaptable the person is, 

“We are not primarily interested m practical applications of the test li wo 
can classify people according to the way they solve problems, thou wo rv.mt 
to study how they get that way. Probably anxiety is an important cause <>1 
many of drese eirois, but we want to learn why one anxious person’s minis 
habitually diffei from those of another We will eventually do experiments 
in which we frustrate people m various ways to sec if different kinds of emo¬ 
tional stress produce different sorts of euors. But before wc can do such re¬ 
search we have to be able to distinguish and measuie these errors'.” 

It is easy to think of applications for such tests as Sheridan and Samuels 
are developing The tests might be useful in diagnosing menial patients, m 
{selecting students for specialized training, 01 in analyzing student# whoso 
school performance is below then ability level. Veiy often, tests’ that arc de¬ 
veloped for laboratory investigations are put to piachcal use by applied 
psychologists. 


Our three examples represent only a few of the many ways in which tests' 
are used. We might also descabe the clinical psychologist hi a hospital, the 
tester preparing standardized tests fox school me, the vocational uiuaseloi^ 
and many others. In addition to these highly qualified investigators, we 
might pay more attention to the Miss Fields who give most of tire tests iu 
offices, clinics, schools, and industries. From the portraits' presented sse 
can draw these generalizations which warrant learning about tests: 

Tests play an important part in making decisions about people and m 
1 For a technical description of research using a machine of this type, see John, 1957. 
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psychological research There arc a gicat variety of tests, coveiing many 
soits of characteristics. Even foi a single characteristic such as mental ability, 
thcie aie many tests which have chfTeient uses. The significance of test 
semes is greatest when they me combined with a full study ol the poison Ivy 
means of interview, tasc-lnstoiy leeoids, application blanks, and othei meth¬ 
ods. Tests pi ovule facts which help us understand people; they almost novel 
are a mechanical tool which can icndci decisions automatically. 

PURCHASING TESTS 
Who May Obtain Tests? 

Tests me useful to many piofessions, but m the hands of poisons with in¬ 
adequate burning tliov do a gicat deal of liaim An untiainod nsei may ad¬ 
minister a lest mcouotlly lie may place undue lehanec on liiatcuiale 
measmements. lie may niisundeistand what the test ineasmes and leach 1111 - 1 
sound conclusions It is theiefoie nnpoilanl foi the usei to confine himself to ■ 
tests that he can handle piopeily 

To sec the implii ations of this leinaik, considei indusliial pcisotmel testing 
as an example To a manager it may appear simple to give a gioup liilelh- 
geneo test, seme it with a punched-out key, tabulate the semes, and line 
the; best man \. personnel psychologist, howesei, knows that on some' rou¬ 
tine jobs average men make better employees than highly rntelligenl men, 
who become limed and quit lie knows that a general menial lest does not 
measme the abilities most important in many laeloty jobs. lie knows that 
even expei Is make earns when they by to guess which tests will pi edict 
success m a given job; a scientifically designed byout is essential to make 
sine that the tests actually pick belter employees 

Introducing and operating an industrial testing program inquires many 
diffcicnt abilities 

1. Analyzing the job to identity abilities which could he relevant. 

2. Selecting piomismg lasts foi tryout. 

3. Constructing new tests when no published test is suitable 

4. Planning and caitying out an experimental trial; choosing the final 
set of tests. 

S Deciding how test results me to he used m selection. 

(i Routinely administering tests to applicants 

7. Scoring. 

8, Interpreting the test and making lining decisions within the general 
plan. 

A gieal deal of training is required to pciform steps 1 through 5. For most 
tests used in industry, steps 6 and 7 can be peifoimed by an intelligent clen- 
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cal worker under proper supervision. Step 8 may be a routine operation or 
may call for a decision by an executive who considers a psychologist's recom¬ 
mendation along with other facts 

Industrial personnel workers in the United States are qualified at vaiious 
levels 

• Diploma m mdustiial psychology A diploma is given by the Amencan 
Board of Examiners in Professional Psychology to an industrial psychologist 
who possesses (among other qualifications) the training and experience re¬ 
quired for carrying out all phases of a testing program. 2 A person holding 
this diploma is called a diplomate 

e PI) D in personnel psychology A psychologist at this level (who may 
have received his training in a university depaitment of psychology, educa¬ 
tion, or business management) should be able to perfoim all the functions 
listed above If he has limited expenence, he may need to consult a better- 
qualified person, especially m planning the progiam. Numerous consulting 
firms provide assistance of tins type. 

• Limited specialized training Woikers who have tiaining m personnel 
methods equivalent to a mastei’s degiee can cairy out specialized functions 
within a general plan They can administer complicated tests, collect data 
on the performance of employees, and make some decisions about indi¬ 
viduals A psychologist can tiarn an intelligent assistant to perform such 
functions, although he must then provide close supervision.’’ 

• Intelligent workers without psychological training. A peison without 
psychological tiaining can learn to administer many group tests, take charge 
of the sconng of objective tests, and apply mechanical rules for selection on 
the basis of scores. 

a Ordinary clerical workcis Woikers at this level should be used only 
foi routine sconng under competent supervision, and foi assisting m test 
administration 


If we were to consider some other use of tests such as a vocational counsel¬ 
ing service, a school testing progiam, oi a diagnostic seivice m a mental hos¬ 
pital, we would observe similar needs In each of these services theic is nerd 
for some routine handling of tests and test data, ioi responsible .supmvision 
and for high-level planning of the total pi ogram A testing pj ogram involves 
far moie than buying a package of tests and going to woik 
Tire amount of specialized training required depends upon the tests to be 
used Some tests can be administered and interpreted by responsible per¬ 
sons who have no specialized training Other tests serving the same general 

mfonnation on diploma F ° r fmther 

to other piisonei* ounterleite :) to gwe even fanly complicated tests 
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purpose can be used only by well-qualified psychologists. For example, two 
tests which might have some value in selecting men ior training as junioi 
executives aie the Ohio Slate Pw/rhologiccil Examination and the Thematic 
Appci coption Test (TAT). The former is a pi ease, and fairly difficult test of 
vocabulaiy knowledge and veibal reasoning ability. The directions and 
sconng proceduie aie so simple that a careful high-school graduate can fol¬ 
low them An employer with no psychological burning can easily understand 
what the results mean To adnumstei and mteipiet the TAT, a poison must 
have graduate training in the psychology of personalrty and should have ad¬ 
ditional supervised experience with this puiticulai test It is used to investi¬ 
gate the drives and creative abilities of the applicant, and the conclusions it 
suggests are not highly dependable Seiious enois in judgment would in¬ 
sult if the lest weie inteipietcd by anyone save a cautious and able psy¬ 
chologist. 

The APA Code for Test Distribution. Distnbutois of tests by to lestuot sales 
to qualified pcisons, just as the sale of medic ines is lestnc led Test distubu- 
tois cheek the qualifications of puiclmsois to dcteiinmo whelhei they aie 
able to use wlutesei tests they ordei fievoie leslnehons aie placed cm the 
tests which aie most difficult to mteijuet and the misinleipietation of which 
would he most scrums 

A fuilhei leason for restiieboti is to pi event copies of questions horn full¬ 
ing into tlie hands of pcisons who will lulci lake the test Students would 
like to become familial with a college enhance examination m advance, but 
this knowledge would give them an uufaii advantage ovei other applieants. 
Patents sometimes by to help then child by coaching luni cm intelligence 
test items, but to the extent lb.it then coaching succeeds, it jnevents the psy¬ 
chologist fioin making sound decisions The conliol system piotects all le¬ 
gitimate' users of published tests. 

The guiding pi maples of the ccmtiol system aie set down in the Ethical 
Stanilattls of Pu/r7mh>gi.sfs This impendent statement was officially adopted 
by the Amciu an Psychological Association m Hint) The billowing paiu- 
giaplis ubshuet and jiaiaphi.ise the loimal statement, omitting legalistic clo- 
lails apjiljmg only to bmclerlme pioblems {Ethical Standmd .s, 195d, jiji 
146-1*18): 

Tests and diagnostic aids should be icleased only to pen sons who can 
demolish ale lliat thex have 1 the knowledge and skill ncccssuiy ioi tlic'ii 
ell ec lire use and mtc'i pi elation. Tests can be classified in the following 
categoi les 

Lei cl A Tests 01 aids winch can he adequately admnuslcied, scoiocl 
and mli'ipu'led with the aid of the manual and a general oncnlation to 
the kind of 01 gam/ation m which one is xvoikmg (Examples educa¬ 
tional achievement, trade, and vocational pioficiency tests ) Such tests 
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and aids may be given and intei preted by responsible nonpsychologists 
such as school principals and business executives. 

Lewi B. Tests or aids which require some technical knowledge of tost 
construction and use, and of supporting subjects such as statistics, indi¬ 
vidual differences, tire psychology of adjustment, personnel psychology, 
and guidance. (Examples: geneial intelligence and special aptitude 
tests, interest mventoiies and personality screening inventories.) 

These tests and aids can be used by persons who have had suitable 
psychological training; or aie employed and authorized to use them iu 
their employment by an established school, government agency, pi busi¬ 
ness enterprise, or use them m connection with a course for the study of 
such mstiuments. 


Level C. Tests and aids which requite substantial understanding of 
testing and suppoitmg psychological topics, together with supeivised 
experience in the use of these devices. (Examples' clinical tests of into!- 
j ligence, and personality tests ) 

Such tests and aids should be used only by Diplomatcs of the Aineii- 
can Board of Examiners in Professional Psychology; or persons with at 
least a master’s degree in psychology and at least one year of projioily 
supervised experience, 01 other psychologists who aie using tests foi re- 
search or self-training purposes mth suitable precautions; or gindnate 
students enrolled in courses lequmng the use of such devices undci the 
supervision of a qualified psychologist; or members of kindred profes¬ 
sions with adequate training m clinical psychological testing, or grad¬ 
uate stridents and other professional persons who have had training and 
supervised experience m administering and sconng the test in question 
and who are working with a peison who is qualified to interpret the test” 


Being a framed psychologist does not automatically make one a quail- 

IstsZVs J T S °f PSydl0l0glCal test 4 ein g T»Med as a uL of 

in a specialty such as peisonnel selection, icmodial readme VtK ., 
lonai and educational counseling, or psychodiagnosis docs noUwees 

;t ;2ro^ 

, project,™ tedimques, mtell.gcce test,, standaidtal .Vv.CLt.sK.' 


uitprirr L g /r k r vanes rr™ i " ^ • 

diiectory of diplomates and sirmla ■ ^ me 0 eac ^ purchaser against the 

Saton, aie sufficed f„, ft e , csls LTs'oXfiTi™T "'"f'T ^ 
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ask some qualified psychologist who knows the pm chaser (e.g., one of his 
formei professors, or his clinical supervisor) to endorse liis request The 
publisher evaluates this information and authorizes the poison to puichose 
tests up to a (citain level Because such investigations are costly, some of the 
sniallci publishers have made no effective effort to eontiol sales of their tests. 

The ethical lesponubdity for lostnetmg tests lasts on the purchaser as 
much as on the distnhutoi. A poison who uses a tost for which his training is* 
insufficient inns the nsk ol making scnous eirors. It is essential that eveiy 
tester evaluate his own qualifications (discussing them with a better-trained 
person if lie is m doubt) and decide what tests lie is ieady to use. Ideally,' 
piofessional woikeis would restrict then own testing by self-contiol, so that 
the puhhshci would have to concern Inmself only with nonpiofessionals such 
as employers who behove that anyone can apply personality tests, paients 
who want to lest their ehildien’s intelligence, 01 job applicants who want to 
practice for tests they may be asked to lake 

1. Sometimes a foster rolies on the distributor’s judgment, thinking like this: "I'm not 
sure whether I'm qualified to use this test. I'll order it, describing my training 
honestly, then if tho publisher sells it to me, I will know that I'm qualified." What 
is wrong with this attitude? 

2. An employer without psychological training decides to buy personality tests and 
use them on applicants What is gained by refusing to sell him the tests, in view 
of tho fact that without thorn he will base his judgments entirely on superficial 
impressions gamed through an interview? 

3. Examine two or three publishers' catalogs to see what statements are made 
about restriction of sale. Are tho restrictions uniform? Do they follow the APA 
code exactly? 

4. Classify tho following tests according to the levels of the APA code. 

a. A mechanical aptitude test requires the person to assemble simple objects 
(e g., a mousetrap) as fast as possible. 

b. The Strong Vocational Interest Blank is an objectively scored questionnaire. 

c. A test of arithmetic computation is Intended for screening store clerks, cash¬ 
iers, and similar employees. 

d. A diagnostic oral reading test calls for careful observation of the pupil's er¬ 
rors, solf-confidenco, method of attacking unfamiliar words, etc 

5. What is meant in the code by tho phrase "with suitable procautions"? 

6. Tho code does not authorize distribution of tests to people who wish to assess 
their own aptitudes, skills, or personality characteristics What are the reasons 
for this policy? 

7 Most American tests aro distributed through publishers to anyone who is quali¬ 
fied and wishes to buy them Another system is found in various national em¬ 
ployment services and youth agencies, especially in Europe. Each counseling 
service devises a special set of aptitude tests for its own use Only the counselors 
employed by this agency are allowed to use the tests What are the advantages 
and disadvantages of this type of control, compared with the usual type of dis¬ 
tribution? 



14 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


Sources of Information About Tests 

i A fiist step in looking for tests is to consult the catalogs of major lest pub¬ 
lishers. Except for a few tests obtainable only from smaller firms, the im- 
poitant;tests aie distributed in the United States by five companies* Cali¬ 
fornia Test Buieau, Educational Testing Service, Psychological Cmpoialion, 
Science Reseaich Associates, and World Book Company. The poison need- 


fa) ® Mechanical Comprehension Tests 


George K Bennett, et al 

Designed to measure ability to understand 
mechanical relationships, these tests consist 
of drawings with simply phrased questions 
about them 'Hie effects of special envuon- 
ment and of tote memory of physical laws 
are minimized Useful in selecting personnel 
for mechanical work and for selection of 
students foi technical and engineering train¬ 
ing. 

Form AA has norms for a large variety of 
school and industrial groups Appropriate 
foi general population testing, for more 
highly selected groups, use Forms BB or CC 

Form AA-F is identical with Form AA ex¬ 
cept that instructions and questions are in 
both English and French, for use with French- 
Canadians 

Form AA-S is the Spanish language edition 
of Form AA, preliminary norms from Cuba 

Form BB is more difficult than Form AA 
Norms based on ten groups of students, appli 
cants, and employed technicians and engineers 

Form BB S is identical with Form BB except 
that instructions and questions are in Spanish, 
preliminary norms from Venezuela 


Form CC (Owens Bennett) is more difficult 
than Iorm BB and yields a wider range of 
scoies at high ability levels Norms are based 
on engineering students 

Form Wl (Bennett-l’ry) is the women's form 
of this senes Norms arc based on high stlinnl 
freshmen and senior girls and several ocuipi 
tionaJ groups of women. Difficulty lc\c*l is 
between AA and BB 

High school and above Time no limit, 
about 30 mm Arranged with the test in 
reusable booklets and with separate IBM 
answer sheets, which may be scored cither 
by hand or by machine 

Order booklets and answer sheets separate 
ly, specifying form and quantity of taih, 

Booklets, sold in packages of 25 with manual 
and scoring stencils, 

1-9 packages $4 50 each 

in or more packages •! oo ctieli 

Single topics 2 5 cents each 

Answer Sheets Specify Form AA, BB, CC, 
K'l, AA-S oi BB 5 (A A V uses regular AA 
answer sheet) bold only in packages of 50, 
$1 90 eaeh, tnd packages of 5U0, $16 00 each 

Specimen Set, 50 cents Specify joint desired 

Spanish forms AA-S and BB S together in 
one specimen sec, $1 00 


ing to puichase tests should tiheicfoic obtain (lie cuzzent catalogs of these 
Aims, and of othei publishers likely to have lesls in Ins field u| mleiesl 1 

The catalog lists and desciibes tests Most of the catalogs indicate dojilv 
what level of limning is lequned it) use each test, and who may pun base il 
The publishers recommendation should be viewed conservatively, m some 
instances the publishei indicates that a test can be used by a puithasci 
with limited taming, even though testing autlionties would iavm a stricter 
standard 

Just what information the catalog itself can piovide is illustrated by lire 
exceipt above descnbmg the Bennett tests, which we shall discuss fully m 

4 Addiesses of publishers are given in the Appendix 
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Chapter 3 and subsequently." (The first symbol in the excelpt is a code lettei 
—a—indicating that this falls in the least restricted category of tests Any 
iccognr/cd business 01 mdustiial firm may purchase this test foi use in per¬ 
sonnel selection, even if llieie is no qualified psychologist on its staff ) 
Tests may be suggested by seveial additional souices, particularly the 
Menial Measurements Yearbooks (see p. 101) 

Bchne a decision to pm chase, the tost fm use is made, a detailed study of 
its manual is needed. Wheteas the catalog descuption is only a paiagiaplr 
long, the manual offers several pages of mformatron on the purposes to 
winch the test is best suited, methods of administering and interpreting it, 
and its limitations .Sometimes the part of this information which describes 
the reseal oh basis of the lest is placed m a technical handbook, leaving the 
less tochimal descuption foi the examine]'s manual If the manual is divided 
in this way, both paits should be consulted A “specimen set” of a test is a 
package including a manual, test booklet, and stonng key. Most univeisities 
and maiiv school systems and counseling cenlcis maintain collections of 
,specimen sets lor the use ol students and professional staff In addition, spec¬ 
imen sets may be purchased directly fiorn the publishm. 


Suggested Readings 

Denton, Aithin I, Cteielnal disease m a child In Aithui BiuIon & Robeit E Hams, 
Clinical slucltt s of prismialtty New Yoik-Il.upei, 1955 Pp 600-611 

A difficult piohlnu in pssehodiagnostics is simply piesenled A 9-ycai-old 
was lelenetl because of einotion.il and school piobletns lost pcifoimancc on 
the St.infnid-Dmel scale and on special diawing tests showed gie.il v.ui- 
abilitv m mental lunctiomng. Intelpielation of the pcifoimancc led to a 
diagnosis of hiam disease, subsequently confirmed by an opeiation 

C'ltiUhfield, Richnid .3 Confoimity and charactei Amci Psychologist, 1955, 10, 
191-198 (Hepimted m Don I! Dulanv, Ji , & otheis. Contributions to modem 
psychology New Yoik Osfoid Univeisilv Piess, 1958 Pp 293-307 ) 

In an illustiatmn of the use of test pioccdmes to advance scientific knowledge, 
an espeinnenl.il test of leadmess to eonfoim to the opinion of one’s gioup is 
di'suihed Hesnlts show the lel.vUtnr of this tendency to peisonahlv ami to 
the iialme of tin* gioup 

I.awson, Douglas K Need loi safeguaulmg the field of liilelhgenee testing 
/ eelue Psychol, 1011, 35, 2-10-247 

Knots aie made when te.uheis with inadequate lidining admmistei m 
inleipiet tests oi mental ability. 

Ogg, Kli/ahelh Psychologists in actum Public Affairs Pamphlet No 229, 1955 
'I Ins dost upturn of the loles ps\ chologists play is wntten foi laymen and foi 
those considering c.ueeis in the field 

Supei, Donald B A case study m exploiation- cumculai and occupational The 
psychology of careeis New York. IIuipci, 1957 Pp. 92-100. 

0 From the 1959-1900 catalog of the Psychological Corporation. 
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A typical problem in counseling an adolescent girl who is uncertain about 
possible careers is described, Information from aptitude, interest, and achieve¬ 
ment tests is combined with the gill’s own statements and her school lecmd 
to help hei to a gi eater degree of self-underslandmg. 

Swanson, Wendell M , & Lmdgien, Eugene The use of psychological tests hi 
industry. Peisonnel Psychol,, 1952, 5, 19-23. 

A questionnaire survey of Aims m Minneapolis and St, Paul gives a realistic 
summary of the testing programs in use for selecting employees. 
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Purposes and Types of Tests 


DECISIONS FOR WHICH TESTS ARE USED 

ANYONE who woiks with people is continually making decisions. A per¬ 
sonnel nianagei decides whom to hhe; a teuehei decides whether each pupil 
is icady foi long division, a physician decides how a patient should he 
healed. II the decision niakci ohlaius hctlei information befoie making his 
decision Tie will have a bellci chance of attainmg tlie results he desiies. 

All decisions involve piediction. Any test tells about some dilfeicnce 
airfong pc'ople’s j)ei founances at this moment, That fact would not be woith 
knowing if one could not then piediet that these people will differ in some 
olliei peifoimaiiee 01 m the same peiformance at some other time. 

Consider a test o( visual ieeognition We Hash a iow of letters on the scieen 
for an instant, and the peison rt'poils what he has seen Some people lecog- 
iii/o fcuu lelteis, others giasp seven m the same brief intcival This differ¬ 
ence is mtiiguing, but it is unimportant until it can be lelated to some other 
beliavuu The applied psychologist sees that this yi.sk possibly has something 
in eommem with ail plane ieeognition and with perception in leading. lie in¬ 
vestigates whether the (lush-iecognition test will predict success in these 
practical activities. II so, it can assist the aimed foioos to select lookouts, or 
help the piimuiy guide teaehei to plan leading instnfction. 

I’mliction is involved in cluneal use of tests also. A clinician might use 
tin 1 Hash technique to see whether a person has especial difficulty in perceiv¬ 
ing emotionally toned winds like "in'// and failure, that being a possible indi- 
eatm of emotional distmbanco Such a lest is useful only if the unusual scoie 
fmesluidows deviant behavior at some time in the futiue The clinician 
would not need to detect emotional maladjustment if that were only an in¬ 
ternal condition which could ncvei ciop out. The significance of the clinical 
test hinges on the Fact that eeitain responses peimit one to predict behavior 
* which should he forestalled or encouraged. 

< The scientific investigator may not care whether the tests he uses have 
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value for practi cal de cisions. He may not even be in teres tc d in individual 
^ljfferencesT^ButTie^too^nust have tests which predict. The flasli test is a 
gooJTaboiatoiy mcasunng mstiument because its scores are stable, ^condi¬ 
tions aie not alteie d, a peison makes about tlie same seorojcach,time lie is^ 
tested, t hus today’s tcst piedicts toinoriow’s^score. If tlie score changes when 
the experimenter changes the illumination, we know that the change re¬ 
sulted from the illumination and not fiom chance variation I lie 1 ox pen- 
menter therefoie can study systematically how flash perception is lelaled to 
illumination When this lelation is fully understood, Jay has a geneial law 
which predicts wh at changes m perception will a ccompany changes m ll- 
Turnmation If the test were not able to predict tom oirow s perfonnauoe fiom 
today’s (othei things being equal), it w ould be of no use to~illc expel iniental 
psychologis t 

1. Demonstrate that prediction is intended in each of the following situations 
a A foreman is asked to rate his workers on quality of work. 

b. Airlines require a periodic physical examination of pilots 

c. A psychologist investigates whether students are more "liberal" in their at¬ 
titudes toward birth control after two years of college study. 

cJ. A teacher gives James a grade of C in algebra and Harry a grade of A 

2. Tests are used to obtain information which will permit sounder decisions Does 
this statement applet to the Gallup public opinion poll? 

I ’ 

Selection 

Tests aid m making many soits of decisions, including selection and classi¬ 
fication of individuals, evaluation of educational oi ticatmenl pioeeduies, 
and acceptance oi i ejection of scientific hypotheses ’,We shall consider hi icily 
leach of these types of decision, beginning with selection. 

In a selection. decision, instit ution decides to accept some men and to 
reject othei s lining an employee is such a selection decision '['lie distin¬ 
guishing feature of the selection decision is that some men aie lejeeted, and 
their future peifoimance is oi no concern to the institution A poison liinv be 
“selected” and “classified” at the same lime. 

^ Classification 

In classificat i on, we dec ide which of man y possible assignments oi tieat- 
ments a person shall ieceive ~Examplcs The college student asks a eouuseloi 
to help him choose the best cumculum The Navy tests each reciuit to de¬ 
termine whethei he should be assigned to the engine ioom, the cluutiuorn, or 
the gun turret The schoolboy who leads pooily is given a senes of tests to 
determine what method of remedial instiuction he needs, and whethei he 
should fust have some other treatment (eyeglasses, psychotheiapy, etc.). 
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One impoitant classification problem is diagnosis of mental patients. This 
may seem like an attempt merely to find the right name for a patient’s clis- 
oidoi, but it really is a choice among tieatments, since the patient’s label de¬ 
termines what treatment he gets 

Where people aie assigned to dijfeicnt levels of wcnk (rather than to dis¬ 
tinctly diffeiont types of woik) we have a placement decision Placement is 
a special case of classification "Placement tests” aie used to allocate college 
fieshnicn to the piopei section of English, i.e , to the appiopnnte tieatment. 
Choosing officei candidates fiom among enlisted men is a placement deci¬ 
sion rathei than a selection decision, since the men not chosen as officeis re¬ 
main m the .limy and aie used in a difTcienl way 

A sli.up distinction between clas sificatio n decision s and selection deci¬ 
sions is lequned because a test which is useful in making one type of deci¬ 
sion may not help with the otliei (Cionbach and Glesoi, 1957) A te st which 
detects senous emotional jdistiubances would be very useful in keeping un¬ 
stable men out nfjlie Anns (s elec tion) The test might not help at all, cm the 
othei hand, m deciding h ow to t ieat men w ho hi oak down lit the scivice 
(classification). As \ve shall see in Chapter 12, one inteipiels validity data 
quite dilleieully ioi classification and selection puiposes. 

Testing often leads to a descupturn of the poison, which can he fai moio 
individualized than a snnplei el.issifiealion Foi instance, a test batteiy plus 
other facts imglit classifs a student as a piomismg engmcei, and this would 
lead him to a decision to cm oil in engiueeiing A dcsenption would icpoit in 
addition the many paiticulai assets and liabilities that distinguish this stu¬ 
dent iiom othei piospeclive engineeis lie is especially mtcicsted m avia¬ 
tion, he has a luthot immatiue and uncoopeiativo attitude toward supenois, 
he vvoiks eneigelicsillv m slioit lnnsts, with no long-iange scheduling All 
these facts aie uselul to the eounseloi. Each one huais on a difieient decision 
about com so planning, about chseiplin.uy tieatment, about advice on study, 
and so on 

When a test is used dost nptivoly, wo do not confine ouiselvcs to one defi¬ 
nite question Halhei, \\ o ti v to rec oid all impoitant laels so that they will bo 
available when questions about tieatment aiise. A desciiplion may catalog 
a student’s mteiests, dost ill>e lus peisonalily pallein, or give 1 an inventory of 
lus knowledge about bis majoi field. The desciiplion is multidimensional and 
helps us icsolve many difieient questions about how to beat the poison, 

Evaluation of Treatments 

So fai we have considered only de cision s abou t individ uals Tests aie 
•equally impoitant us an aid in evaluating tieatments When the teacher 
gives an anthmotic test, lie is testing his instruction as much as he is testing 
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the students’ effort and ability. If the results are poor, he should probably 
alter his method When moie than one instructional method is under con¬ 
sideration, an experimental compaiison can be made; a test shows which 
method gives the best results and should be used hereafter. 

In industry, questions about tieatment or management can lie decided by 
suitable tests. The effectiveness of training is judged by peiformance tests. 
Supervision and personnel policies can be judged by tests of attitudes and 
morale. 


Verification of Scientific Hypotheses 

The functions discussed above illustrate the usefulness of tests in making 
decisions of immediate practical impoitance Tests are also used extensively 
to measure outcomes of scientific experiments, as was illustrated in out ear¬ 
lier discussion of the measuiement of flash peiception/The oxpciimentei is 
not making decisions about paiticulai individuals Ho is trying to decide 
wliethei to accept or reject a paiticulai' hypothesis (such as, "The change ol 
perceptual span with cHange in illumination is gi eater when a subject is 1111 - 
dei stress”). Jest s provide a moie objective and dependable basis foi com¬ 
parisons than do lough impressions 

SometimeTtHe investigated uses tests published for practical puiposes, but 
a test tailor-made to fit the experiment will often woik better. In one study, 
for example, the expeiimenler played phonograph recordings of words 
backwards, in older to study how people learn to lecogmVe stiange stimuli 
(Lewis, 1946). Such a task, just because it is novel, makes a veiy good experi¬ 
mental test 

3. Show that a reading test might sometimes be used by college counselors or ad¬ 
ministrators for each of the four types of decision listed above. 

4. Classify each of the following according to the type of decision represented 

a. A foundling home measures intelligence of a child and uses this as a basis 
for deciding which home to place the child in 

b. An instructor rides with a pilot at the end of his training, and fills out a 
checklist to show which maneuvers he performs correctly, 

c. A psychologist compares the average intelligence of only children with that 
of children from larger families of similar social background, 

d. All applicants for a driver's license are tested. 

e. A test is given in a junior high school for the purpose of identifying adoles¬ 
cents likely to become delinquent. 

f. A university class is divided in two parts, one of which sees the lectures and 
demonstrations by television, while the other hears and sees the instructor 
directly. Both groups are given the same examination 

5. Education and psychotherapy are both .learning experiences, yet tests are used 
much more often for routine evaluation in school than in therapy What reasons 
can you suggest? 
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6. Describe one circumstance where tests might be used descriptively by 
a. An employment manager, 
b A social worker dealing with children, 
c. A teacher of typewriting. 

7 When tests are used to obtain a description, it can be said that a classification 
decision is being made. Explain. 

WHAT IS A TEST? 

The layman is likely to think of a test as a scries of questions requiring a writ¬ 
ten oi mill unswoi Psychological tests are, however, evtiemely vaiied, and 
the vauety is steadily glowing Pea haps the best definition to covei the lange 
of tests desntill'd in this book is as follows: a test is a systematic procedure , 
f oi commnn g the hchm to; of ttva or moio^erson s, Wc sliall not give atten- 1 
lion to unsystematic, spm-of-the-momonl procedures for sizing up a person 
—casual com million, for example! 

We shall examine a huge munbei of principles regarding tests, and a large 
mtinbei oi cuteua lof deciding whether a test is satisfactory. Peihaps xve 
should define lest so as to include all the pioceduies to which these ciiteua 
anti pnneiplcs apply. If xve did this, hcnvcvoi, we xvould have to extend the 
definition to t ovei measmes of animal behavior and measures of nonbehav- 
loiial I'luuaeleuslies. Fm example, to detoirninc how atomic radiation affects 
behavioi of animals, it is necessary to measure their activity befoic-and- 
after, and the proeeebue has to satisfy the same logical requirements as 
does am lest ot human behavioi. In ono study (Isaac and Itucli, 1956), the 
mvestigalois believed that spontaneous movement of monkeys would be 
affected by radiation. To moasme this effect, they tiled four techniques, 
rating by an obseixei, leeouhng fiom a photocell pointed aeioss the cage, 
and two methods of lecmding the movements of the cage floor, which was 
suspended so that it vilnated when the animal moved Deteimimng the 
best technique is just like choosing among educational and clinical tests, 
the expenmcnteis had to apply the veiy indices of leliabilily and test mtei- 
(*(UTelation xvhuh xve shall study in later ehapleis. Thus, while this hook is 
most concerned xvilli tests used to study differences between people, much 
of the mnU'iial is significant for the animal cxpenmentci, foi the sociologist 
eompaiing eonmnmilies, m foi any other behavioial scientist. 

Om definition includes measurements using appaialus, Iaboiatory pro- 
ceduies for obseivalum of social responses, questionnaires for obtaining re- 
poits on pcisonalily, and systematic iccoids collected on an industrial pro¬ 
duction line. The icadei is xvained, hoxvevei, that many definitions of test 
are m current use, varying xvith the writer’s puipose Some writers restrict 
the word test to measuring instruments, but we shall not. A true measuring 
instrument is supposed to assign to every person a number which locates him 
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on a scale of equal units, as we do when we repoxt height in inches. Not only 
do psychological tests give less peifect measuiements, in this sense, than do 
instruments used m other sciences, but many useful devices do not measuie 
at all In paiticulai, some peisonality tests yield a veibal description instead 
of summing up the peison by means of scores 

Standardization 

A distinction between standardized and unstandaidized piocedures giew 
up in the early days of testing Eveiy laboiatoiy m those days had its own 
method of measunng memoiy span, reaction time, and so on, and it was dif¬ 
ficult to compaic results fiom diffeient laboiatoiies It was likewise difficult 
foi school officials to answei such piactical questions as whcthei pupils weie 
learning to spell as well as could be expected, when eveiy tcachci used a 
different test Standaidized tests weie designed to oveicomc these prob- 

! lems. A standai dized test is one in whic h thcpioccdmc, appai atns, and scor¬ 
ing hav e, been fixed so tha t piecisely the same test can b e givcn.at tliffciont 
times and places __ 

Some tests are provided with tables of nouns stating what scoies aie usu¬ 
ally earned by repiesentative subjects. Tests having such nouns aie some¬ 
times called “standardized tests,” and the piocess of gathoiing noun data is 
called “standardization” We are not using the woul standardized in that 
sense, because we wish to emphasize standaichzalion of pmceduro. A test 
may have a table of noims even though its pioceduics are not clearly speci¬ 
fied, and a test with well-standardized pioceduics may not have norms. 
Obviously, collecting noims is not piofitable until procedures are well stand¬ 
ardized 

The first majoi step toward standardization of psychological testing came 
in 1905, when a committee of the Amencnn Psychological Association de¬ 
fined procedures (e g , foi testing memory) which could lie followed in all 
laboratories Today, most of the published tests with wlneli Ameiloan ap¬ 
plied psychologists and teacheis opeiatc aie caiefully standaidi/ed. In per¬ 
sonality assessment, however, a numbci of quite unstandardized piocedures 
are in general use 

Standaidization has a place m all reseaich In cxpeihnenlal psychology, 
standardization is not yet as well accepted as m testing, but the need for 
standardized proceduies is much the same These lcmaiks of Undoivvood 
and Richardson (1956, p 84) regaidmg concept-foimation experiments give 
arguments for standardization which apply equally well to tests. 

. tasks or mateiials which have been used aie quite diveise m na¬ 
ture, With few exceptions (e g., Weigl-type card soiting) no systematic 
series of experiments has been built around a single task While this lack 
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of task standardization attests to the ingenuity of individual workeis m 
constructing new materials, the situation may not be entiiely satisfac¬ 
tory for efficient development of laws and theories. In the mme lnghly- 
devcloped amis in psychology only a few basic tasks, piocedures, 01 
materials have been used. Thus, classical conditioning, the Skinnei bos, 
nonsense syllables, the pui.suit 10 I 01 (to mention a few) all have had 
widespiead use. While some may justifiably iai.se questions concerning 
generality of findings based on such a limited nunibci of pioeeduies 
and tasks, it cannot lie doubted that nileilaboialoiy communication and 
continuity is gieatly facilitated by the use of common basic tasks and 
pioeeduies 

Tests \aiv m the completeness with which they aie slandardi/ed Printing 
the questions and inass-pioducing the equipment asstues unilonnity m those 
icspects, hut the dnections to the subject me not always woikcd out in com¬ 
plete detail 1'Aciy condition which aflc’ds peifonnance must lie specified it 
the lest is to lie icgeided as tiuly standaich/ed. Thus foi a test of coloi- 
matchnig abihtv, one needs to use unifonn coloi specuucns, to follow uui- 
foim diiectious foi achminstintioii and sconng, and also to use jnecisely 
the light amount and kind of illumination II standaidi/alion of the lost 
weie fully effective, a man would cam veiy ncaily the same seoic no mattei 
who tested lum 01 when* Theic aie, howevei, many difficulties in completely 
slaiidaidi/mg the tcstei's proceduie and the subject’s attitude, some oi which 
will be discussed in Chaptci 3. 

Objectivity 

Tests vaiy m their dcgicc of objectivity A fully objective lest is one in 
winch eveny obseivei 01 judge seeing a peifonnance ailives at piccisely the 
same repea t To do this, he must pay attention to the same aspects of the 
peifonnance, recoid his obseivations to eliminate enois of lec.ill, and some 
the lccord by the same mles The objeelivitv of the piocechue may be 
judged bv the degice of agicement between the final semes assigned by two 
independent obsciveis, The more sub jet hue the observation and evalua¬ 
tion, the less the two judges agicc 

Tests in which the subject selects the best ol seveial alternative answeis 
(eg, tme-ialse, multiple-choice) aie lefened to as “objective tests,” be¬ 
cause all scoicis can apply a scoring key and agicc peiicctly on the lesult. In 
contrast, an oidinaiy essay test allows room foi gieal disagreement among 
scorers By caieful instructions to the obseivei or scoiei, free-iespon.se tests 
and observations can be made fanly objective. 

8 Judge each of these statements true or false and defend your answer: 
a. Batting averages are objectively determined. 
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b. The 220-yard low hurdle race is a standardized test. 

e. A teacher has each member of the class read the same article in a current 
magazine Time is called at the end of three minutes, and each pupil marks 
the place where he is reading He then counts the number of words read 
and computes his reading rate in words-per-minute. This score is compared 
with a table of average reading speeds for typical magazine articles. This 
test is highly objective 
d. The test described in c is standardized. 

9. Psychological tests often start from very crude procedures Psychologist X thinks 
that he obtains useful information by laying a sheet of paper on the table at 
arm's length from his subject and asking him to touch with his pencil exactly 
in the center of a circle printed on the paper The subject is told to withdraw 
his hand and repeat the movement, as rapidly and accurately as possible, until 
he is told to stop. Psychologist X gives the man a mark from 1 to 10 on each 
of the following qualities- speed, carefulness, and persistence 

a. What changes would improve the objectivity of the test? 

b. What aspects of the procedure would need to be taken into account in 
standardizing the test? 

10. Industrial morale surveys often use questions made up by the plant personnel 
office or its consultants. What advantages and disadvantages would there be 
in using the same standardized questions in many different plants? 

11 . The Kohs Block Design Test (see Figure 5, p 42) is one of the most popular 
testing procedures The sub|ect is required to construct a pattern from colored 
blocks to match a printed sample The test is chiefly used in child guidance, 
clinical diagnosis, and measurement of intelligence of persons who do poorly 
on verbal tests It is also used for research on frustration and on cultural differ¬ 
ences At least twenty versions of the test (different items, different scoring 
rules, etc.) are used in different clinics and different countries. What are the 
possible advantages and disadvantages of this diversity? 

12 . The Kohs test was first published as a long series of carefully chosen items. 
Why do you think so many different versions now exist in different countries, 
even though the test is used for the same purpose in these places? 


Psychomefric and Impressionistic Testing 

There aie two philosophies of testing, giowmg fiom different hisloncal 
roots and fostering different types of lest pioccdmo and interpretation, belli 
aie mingled in conlemporaiy practice. While we cannot discuss these dif¬ 
ferent appioaches exhaustively, especially m this introduetoiy chapter, vve 
can survey the main characteristics of each. 

Fsychome tnc testing obtains numerical estimates of single aspects of per¬ 
formance. Its ideal is expressed in the famous dicta of E. L Thorndike that 
‘If a thing exists, it exists m some amount,” and “If it exists m some amount, 
it can be measuicd ” One can observe m this statemerft a hidden assumption 
that the psychologist is concerned with “things,” i e., with distinct elements 
or tiaits which have a real existence All people are considered to possess 
the same traits (e.g, intelligence, or mechanical experience), but in differ- 
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ent amounts. This view of psychological investigation takes its cue from 
physical science, which identifies common aspects of dissimilar objects and 
dcscubes any object by nuinbeis representing sucli abstract dimensions as 
weight, volume, and intensity of cneigy of a ceitam wave length. 

The second appioaeh leads to a compieliensive desciiplive picture of the 
jnchviduyl. We shall lefei to tins style ol investigation as impressionistic 
Impiessionistio psychologists think that nndeistanding an other person re¬ 
quires a sensitive obseivor who looks for significant cues bylmy available ./ 
means and integrates them into a total lmpiessum. Studying one trait 01 ele¬ 
ment at a time is, in then view, no substitute for considering the person as a 
whole. The lrnpiessioiust is not satisfied with knowing “how much” of some 
ability the person has, lie asks how the subject expiesses Ins ability, what 
kinds of enms lie makes, and why (see Ban on, 1937).' 

To evaluate a subject's backgiound, for example, a psychoinelnc tester 
would have linn respond to a biogiaplncal checklist covciing cxpeiiences 
which many people have and which aie likely to bo impoilaiil m their de¬ 
velopment (For example: “Woie you a Boy Scout patiol leader?") lie 
would seme iespouses objectively by counting the inunbei of items cheeked 
in such categones as "Intelesl in spoils” and “Leadership expeucncc.” The 
impressionist, on the otliei hand, would ask foi an autobiographical essay, 
pcihaps setting no moie definite! task tlninyTlease wiite yom life sloiy cur 
these pages.” Fiom the response, he could see what the subject considers im¬ 
portant about himself, what emotional tone he uses to describe his past, and 
what unique experiences he has had—experiences the checklist would not 1 
cover, The fice iespouse may give little information on impmtaut aieus cov- 
eied thoroughly by the checklist, but it coveis malleus the checklist ignores. 

Each approach has meal, and each has its special limitations. Both have 
contributed to the development of present practice, and neilluu style can be 
adopted to the exclusion of the otliei The measuier must fall back upon 
judgment whenevci he applies information fiom semes m leaching, thinapy, 
or supervision of employees, and the penliaitisl cannot ignore the aeeuiate; 
facts psyehomeltie inslrumenls pi ovule*. Tlieie are sevenal ehffi'ienoes bc*- 
tween the psyelimneliie and impiessionistio schools; a pailieulat testing pro¬ 
cedure may follow one school on one point and anollieu on the nc'xl. The 
styles differ with icspeet to definiteness of tasks employed, eontiol of io- 
sponse, objective recording of basic data, fonnal numerical scoring and 
numerical combination of data to icaeh decisions, and critical validation of 
interpretations. 

Definiteness of Task The lest designer decides liow definitely the ta sk 
is to be explained to thesubjeet In some tests, such as the biographical essay 
mentioned above, the subject is free to employ any style and any content he 
chooses, On the other hand, a questionnaire m which the subject is to check 
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each activity he has engaged m dining the past five years leaves little or no 
room for individual interpretation. 

A test is said to be str uctured when all subjects^ interpret the task in the 
sameway.~The more latitude allowed, the less sliuctuied the test is. 7)f spe¬ 
cial inteiest are projective tests, which ask the subject to intcnpiet a stimu¬ 
lus that has no oSvious meaning Foi instance, he may be shown an inkblot 
and told to leport what it looks like to him If he asks how many ideas to le-^ 
poit, whethei to use the same poition of the blot in t^p ideas, 01 any othei 
such question, he is told, “That’s up to you.” 

Structunng the task controls the performance so that all subjects aie 
judged on veiy much the same basis It therefoio permits a definite answei 
to a question foimulated in advance (eg, how much expcilence with small 
boats has Ihe subject had?) The less strucluied technique allows gioatei 
vanation m responses and m that sense leveals moio individualized ie- 
sponse patterns, (Tlie subject’s essay may, for example, give inhumation on 
some unusual interest, such as training dogs for show, but may toll nothing 
about boating expenence.) 

Recognition vs. Free Response j yfost tests c an be designed cithei ill a fiqe 
tesponse or in a lecogmtion f onru. which allows gienlei contiol of re¬ 
sponses and makes scoiing less impressionistic In a menial test, seuos-tom- 
pletion items (75869 . ) and veibal analogies (wolf is to cub .is cat is to 

-) may be left in free-iesponse form, 01 the subject may be ofhuecl al¬ 
ternative answers from which to choose. 

The psychometric tester generally prefeis the lecogmtion tost because it 
can be more objectively scoied, does not depend on fluency 01 expicssive 
skill, and is less subject to mismteipretation of questions than Ihe hee-ie- 
sponse hum Many testers, howevci, piefei the fiee-iesponse form The 
most unpoitant leason is that the fiee lesponse pci nuts ob.sei vat ions which 
illuminate the scored aspect of peiformancc If a student wnles out a long- 
division pioblem, for example, the teslei can judge his nealnt'ss and Ins 01 - 
ganization of woik, and peihaps can base diagnostic conclusions on tin' ei- 
101 s ho commits 

Product vs Process pr incipal difl'cicnoc between psyeliometne and 
impiessioms tic test ing is that th e lonuei con cerns itscll until IIje. tangible 
lASfu£C6TI|i6..peifoi~ mance—the answer given, the blTi Tklmvei consliin led, 
o r the essay wutte i w -When a psychometnc testei does pay attention to the 
piocess of performance, he aims himself with a lecoid sheet feu tabulating 
what the subject does The impicssiomstic tester, liowevei, watches the sub¬ 
ject at work in ordei to form a geneial opinion, this.gencial nnpiession is in¬ 
deed the basic datum with which the psychologist woi ksA In dcscnbmg then 
mihtaiy testing dunng Woild War II, “Geiman psychologists [who cany the 
impicssiomstic style to extremes] stated repeatedly that observations of the 
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candidate's behavior dining a tost were more important than the actual score 
which he earned . . One man . . stud that the chief fault of inexperi¬ 
enced mihtaiy psychologists was that they attached too much weight to 
objective semes and did not pay enough attention to the foimation of an 
intuitive impulsion fiom observation of the candidate’s reactions and ex¬ 
pulsions. Individual oxaininoix weie poimittcd and often encomagod to 
vaiy testing pioeoduros and to emphasize tlieii favmite tests” (Fitts, 1946), 

Analysis of Results. It follows that founal seming plays a huge part in the 
psyc honietnc test anna veiy nnnoi jiarl in the woik of tliejjnjimssaonislic 
tester Anieucaii devotion to the numeiical seme sometimes goes to such ex- 
tienies that a leslei lepoits nothing about a child but the IQ) calculated foi 
him, distaiding all the otliei mfounation obtained in an lioiu of close obser¬ 
vation. 'I lie tlioioughly linpiessionistic lestei may in Ins turn tianslatc a test 
peifoimante into a charactei desuiption without evei counting up a seme 
Pieieiablv, m individual testing, both scoies and descriptive liifonnation aio 
taken into aeconnt 

Wh(*n .1 decision is to be made, one can apply some formal lule to the vai- 
lous l.iets m inn combine them lmpiessiomstically. Foi example, a teachei 
mav assign a emnse guide by sliietly aveiaging the tests, oi may lonn an 
oveiall impiession that this student is “doing B work even if he did slump 
at the end" and that one is “not l cully as good as his tests suggest.” The psy¬ 
chometric leslei tends to piefer the linpeisonal piocedme, while the un- 
piessionist thinks an mfoimal method is moie flexible and leahstic. 

The psyilmmetiie testei’s insistence on numeiical semes influences lus 
choice of tests Some testeis bombaid the subject with one test alter anolhci, 
seeming to have almost a mystical faith that the accumulation ot numbers 
will pi ovule all the inhumation needed to solve Ins pioblcms In this concen- 
Iration oil mcasuiable vaiiables the tcstei may ignoic equally pcilment as¬ 
pects ol the individual foi which no scoiablc mstiuments have been de¬ 
veloped. It is easy, m child guidance, to obtain measmes of ability, and 
fan ly adequate instruments exist foi obtaining an “emotional adjustment” 
.seme Those semes, however, tell only a small pait of the story, and the psy¬ 
chologist should eoitamly go on to investigate the child's image of lus 
molhei, Ins futliei, and lus teachei, and what activities in Ins life give him, 
the gieatesl satislai lion, even if none of these questions can be answeied by 


a numbei on a scale, oi taken into a statistical humula 

Emphasis on Critical Validation JFmally, we come to the question of cutical 
validation. IQyehomelii ul. s i ■ j 1 I ‘c. 1 '-.L.»--l ■■ b, on 

tests and observations ideally, a psycnomeliic testu accompanies eveiy 
numeiical scene witTHTwairung regarding the enor of measurement, and 
eveiy piediction with an index showing the probability of its coming hue. 
The impressionist is less likely to carry out formal validation studies, often 
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being satisfied to compare impressions based on one procedure with im- 
-p Tr—-r mr - —irr~--T 7 lll( j atl0n qualitative interpretations and 

■ ■■ ■ than validation of scoies and requires a 

greater readiness foi self-ciiticism on the pait of the psychologist. 

The most ciitical issue, indeed, between psychometric and impressionistic 
testing is that of confidence m the psychologist, 'Those who develop and 
advocate rigorous psychometiic piocedures regard the tester as a soutce of 
bias tending to obscuie the tiuth, Those who prefei^less structmcd pioce¬ 
dures regard the observer as a sensitive and even indispensable instrument. 
The impressionist does not deny the danger of bias and random error in 
judgment. He, however, fears that narrowing one’s focus to what can be rep¬ 
resented m a numerical scoie on a standard piocedure throws away most of 
the psychologically important information. The gams from intuitive obseivd- 
tion and interpretation, he believes, more than offset the enois it introduces, 
Most testeis occupy an intermediate position—intermediate between ob¬ 
session with scoies and umestrained use of intuition. Foimal, stucfly objec¬ 
tive proceduies are noimally combined m some mannoi with judgment, 
everywheie save m mass classification progiams such as mililaiy pi ones sing, 
The impressionistic style assigns gieat responsibility to the test inter¬ 
preter He must be an aihst, sensitive to obseive and skillful to convey his 
impressions Some psychologists aie piesumably much better judges of per¬ 
sonality than otheis. The psychometric method seeks pioccduies winch 
everyone can use equally well The objective test is a cameia pointed m a 
fixed direction, eveiy competent photogiapher should get the same picture 
with it. Thus psychometric testing aims to reduce analysis of individual dif¬ 
ferences to a routine technical procedure To the extent that it .succeeds, it 
reduces the need for an authontative, “wise” pzofessional psychologist. A sim¬ 
ilar conflict between the technical and the artistic ideal is found in medicine. 
Laboiatory tests assume more and moie of the buiden of medical diagno¬ 
sis, yet doctors have gieat respect for the legenchuy genius who diagnoses 
unerringly the malady overlooked by the tests. 

13. "Psychometric testing trusts the judgment of the test constructor, whore It is un¬ 
willing to trust the tester." Is this a defensible statement? 

14 Distinguish between structured and standardized 

15 In what respects are the following procedures unstructured? 

a. In the Ayres handwriting test, pupils are told to write the Gettysburg Ad¬ 
dress neatly, doing as much as they can in a fixed time 

b. In the Draw-a-Man test of mental development, the child is told to ’’draw 
the best man you can." 

c In a recorded pitch-discrimination test, the subject hears two tones and re¬ 
sponds H, L, or N, according as the second tone appears higher than, lower 
than, or no different from the first. 

16. What are the advantages and disadvantages of the biographical checklist as 
compared to the essay 9 
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17. Is the issue between psychometric and impressionistic testers one that can be 
settled by suitable factual research? 

CLASSIFICATION OF TESTS 

Tests might be classified in many ways—qceoiding to form, purpose, content, 
and other ehaiaelenstics. We shall place tests in two classes, the fust being 
those which seek to measure the maximum peifoimance of the subject. We 
use these when we wish to know how well the poison can perform at Ins 
best, they may be referred to as tests of ability The second category includes 
those tests which seek to dcteimine Ins it/pical peifoimance, i.c , what he 
is likely to do in a given situation or m a broad class of situations Tests of 
personality, habits, interests, and character fall in this calegoiy, because 
chaineteiiAitions like “shy,” "uiteiested in ait,” and “anxious when in dis- 
agieement with a supcnoi” describe the individual's typical behavior. 

Tests of Ability 

The distinguishing feature of a test of ability is that the subject is en- 
couiaged to cam the best score he, can. An ability is a response subject to 




FIG, 1 Two of the Porteus mazes The subject is required to trace the correct path through a 
maze A failure is judged whenever his pencil enters a blind alley He is then given further 
trials on the same maze. When he gets one maze correct he goes on to a more difficult one, 
continuing until he falls several trials at a particular level (Copyright 1933, The Psychological 
Corporation. Reproduced by permission.) 
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voluntary control (McClelland et al., 1958, p 206) Natuially, the adequacy 
of the test depends upon the degiee to which the person is motivated, i.e., 
upon his willingness to demonstiate his ability. The goal of the lestei is to 
bring out the pei son's best possible peiformance. 

Notice that we define ability tests m terms of what the tester is trying to 
learn lafhei than by descnbmg the test itself. A test intended to reveal muu- 
mum peifoimance sometimes fails to do so (foi example, when the subject 
becomes too tense to peifoim well). Moreover, the same testing pioceduie 
can be used eithei to measure ability oi to study typical peifonnance, Inn 
example, although the Porteus maze (Figure 1) can be scored solely in 
teims of speed and coirectness, it also pcimits the testoi to observe how 
much foiesight and planning the subject uses Since any test pcrfmmance 
depends on both ability and personality, our classification is somewhat arbi¬ 
tral y 

Some ability tests measure peifoimance on lamilnu tasks' ioi example, a 
road test foi a diivei’s license Otheis lequiie the peison to do something 



FIG 2 The Complex Coordination Test 


completely uiifamiluu The Complex 
Coordination Tost lequiies a peison 
who has nevci flown a plane to op¬ 
erate a “stick” and “inclder bar” just 
as if he were flying Flashing lights 
signal foi ceitam movements. Jf lie 
can follow the directions and make 
the nccessaiy com durations he gets 
a high scoic. This task lcpioduces 
one aspect of the flyei’s job, other 
things being equal, a peison who is 
superior on this test will be a su- 
pcnoi pilot. 

Tests measuiing maximum pei¬ 
fonnance aie lefeucd to as mental 
tests, intelligence tests, etc. We shall 
not define these teims formally, in¬ 
deed, most of the teims have no 
well-established definition One huge 


gioup of tests we shall lefer to as measures of general mental ability. They 
seek to measure those mental abilities which arc valuable m almost any type 
of thinking or learning Tests of this soit aie often called “intelligence tests," 
but that name leads to controversy because “intelligence” has so many mean¬ 


ings Gcneial abilities may be conti asted with the moie specialized abilities 


which are of value only in a limited range of tasks Among the specialized 
abilities aie mechanical comprehension, sense of pitch, and finger clcxteiity. 
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There is no widely used name foi tests of this sort, we shall lefer to them as 
measures of special abilities, While a test foi a single specialized ability may 
be used by itself, it is moio common to test several such abilities at once so 
as to study the poison's ability profile. 

A ptoficienct / test measures ability to perfonn some task which is signifi¬ 
cant in its own light: reading Fieneh, playing a piano, tioublo-shooting an 
ail plane engine Since one of the principal uses ol such a test is to evaluate 
pcifoimanee of poisons who have been given training in the task, these tests 
aie often refened to as achievement tests 

An aptitude test is one used to piedict success in some occupation or Gain¬ 
ing comse—tlioio aie tests of engineeimg aptitude, musical aptitude, apti¬ 
tude foi algebra, and so oti In foim, these tests arc not distinctly diffeient 
from other ts pcs An engineering aptitude test may include sections meas¬ 
uring geneial menial ability, mechanical and spatial reasoning (special abili¬ 
ties), and pioficioncv in mathematics The lest is lefeircd to as an achieve¬ 
ment test when it is used pumaiily to examine the poison’s success in past 
study, and as ail aptitude test when it is used to fmccast Ins success m some 
future corn so 01 assignment, 

Tests of Typical Performance 

Tests of typical peiforinanee aie used to investigate not what the poison 
can do but what he does Tlioio is little value in delei mining liow courteous 
a gnl applying foi stoic employment can be when she lues, almost anyone 
of noimul upbringing has the ability to be polite The test of a suitable em¬ 
ployee is whether she maintains that eouilesy m hei daily wmk, even when 
she is not “on hei best behavior ” To lake anothoi example, any inspeetoi 
with propci vision and training should be able to detect defective pails A 
test which deteimines how well he spots defects when trying especially hard 
would measmo \ision latliei than carefulness The duel diffeience between 
the good and the pool inspeetoi is that the lattei pounds himself to be dis¬ 
tracted and careless in lun-of-the-mill duty 
For cheer fulness, honesty, open-mindedness, and many otlioi aspects of 
behavior, a test of alubtv has almost no pi actual value, Most people can 
produce a show of the beluuioi when it is demanded of them. But those 
who act cheer hilly, honestly, or impartially when they know they arc being 
tested may not do so m other situations Typical peiiormance is important 
even when we are concerned with aptitude for success, If v\e aie liiiing an 
executive whose past success guarantees his ability, we also wish to know 
how he usually operates Does lie supervise closely, down to the last detail? 
Oi does he outline a general task and turn Ins subordinates loose? Is he 
equally concerned with production, human problems, and finances? Does he 
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prefer long-iange planning or quick adaptation? Knowing Ins pattern is 
necessary to place him properly in the organization 

In tests of ability a high score is desirable, but in most tests of typical per¬ 
formance no paiticular response can be singled out as good. For example, 
there is nothing good 01 bad about inteiest in engineering. One who lias 
this interest can use it, but one who does not finds other woith-whilo ac¬ 
tivities Likewise, people show wide variation in dominance-submission in 
social relations We cannot say that any certain degiee of dominance is best, 
since our world has places tor persons of all types. 

The peison’s chaiactenstic behavior is oui best clue to his personality. 
Habits have piedictive value m themselves; what a person does once he is 
likely to do again Most psychologists would object, howevci, to assuming 
that a person’s obseivable habits are his peisonahty. New situations contin¬ 
ually arise, and a description of lus customary behavioi does' not directly in¬ 
dicate what he will do in a new situation A boy may have a reputation as a 
womanhater, but some girl will come along who arouses a quite difleiout re¬ 
sponse, A clinician who establishes waim relations with most clients will en¬ 
counter some who arose only hostile feelings in him Because we do not wish 
to regard these exceptional reactions as capiicious and unexplainable, we 
interpret reactions in various situations as reflections of a moio basic and 
consistent “personality structure ” This shucture has to bo inferred limn be¬ 
havior. The psychologist hopes that when he understands the stiueture lie 
will be able to predict the person’s responses even to new situations 

Testing of typical performance is difficult. It has been accomplished, with 
greater or less success, in a vanety of ways. These methods may bo divided 
into behavior observations and self-repoit devices. 

Behavior Observations Behavior observations aie attempts to Study the 
subject when he is "acting naturally.” Observations are made both in stand¬ 
ardized test situations and m unstandardized or “natural” conditions. 

The standardized obseivation lequires that each subject be placed m es¬ 
sentially the same situation Personality may he ohscivcd dining a menial 
test, during a gioup discussion, or while the subject is walking a rail blind¬ 
folded Special tasks are often devised which give an especially good opjior- 
tunity for obseivation. These tasks may be icfened to as pat [at tnancc tents 
of personality 

The standardized observation permits relatively exact compai non of per¬ 
sons who are not normally seen m similai ciicumstanccs Moi cover, it lcvcals 
characteristics which could be seen only occasionally in everyday life Such 
procedmes have been used, for example, to observe typical reactions to frus¬ 
tration. The person commences a task and is prevented in. some way from 
attaining his goal. The way he reacts gives insight into his emotional control 
In one famous study, preschool children were given the opportunity to play 
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with ordinary, reasonably interesting toys. Then they were allowed into an 
adjoining room with extremely attractive toys. After a period of play in this 
room, they weic headed back into the fiist room, and a wire screen was 
placed between them and the atti active toys. 

As the ailist’s representation of this experiment shows (Figure 3), the chil¬ 
dren reacted in many ways, pounding on the fence, regressing to simple 
play with locks, liymg to piy undei the fence, or going off to take a pre¬ 
tended nap. The observeis lccoided the children’s behavior, finding that 



FIG 3 Varied behavior among children subjected to experimental frustration (After 
a study by Barker and others, 1941, drawing reproduced from Morgan, 1942, p 249) 


after fiustiation then games weie less maluie and less constructive than be- 
foie. 

If an observation is to bung to light typical bchavioi, the subject must not 
know what characteristic is being obseived. The obseivei may be concealed, 
oi the subject may be led to believe that he is being tested on one behavior 
when something else is being obseived. Thus when reaction to fiustiation 
is being studied, the subject may be told that his mental ability is being 
tested. Ills responses when lie is frustrated by difficult questions aie usually 
genuine and little disguised, 

Data on typical beluivun may also be gathered by observing samples of 
the poison’s oidmaiy daily activities, “m the field,” as it weie. Children on 
the playgiound reveal a good deal about then habits and personality, so do 
noncoms leading platoons, and woikeis in the office. Field observations may 
use elaborately standardized recoidrag proccduies—even sound-motion pic- 
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tures—01 may, on the conti aiy, consist merely of an impressionistic judg¬ 
ment The baseball batting average is a summaiy of systematically recorded 
field obseivations. The mdustiial supeivisoi’s merit ratings aie also based 
on observation, but the judgments aie almost completely unsystematic. 

Self-Report Devices. The subject has had much oppoi trinity to observe him¬ 
self If he is willing, he can give a helpful lepoit of his own typical behavior. 
Questionnaii es are used to obtain such icports The ciueial piohlein m self- 
lepoil, if it is to be interpreted as a pictuic of typical liehaviOi, is honesty If 
the peison ilies to give the best possible pictuic of himself instead of a tine 
descuption, the test will fail of its pmposc, Even when lie tnes to be tmlh- 
ful, we cannot hope that he is a leally detached and impaitial obsoivei ol 
himself IIis leport is ceitam to be distoited to some dcgice 

Most self-repoit inventories oflei a fanly compiehensive pictuic of pei- 
sonality Some of them, however, aie specialized in their envoi ago, Theie ,110 
study-habit mventones, inteiest inventones, social attitude inventories, and 
so on. Othei tests aie designated as “adjustment inventones,” “elnuactci 
tests / 5 etc ; such a name suggests the way in which the score is to be inlei- 
pieted but does not identity a distinctive form of test. 

It is generally agieed that peisonahty questionnaii es should nut use the 
woid lest in thou titles (Technical Recommendations , 1954 , p. 19 ) If an m- 
stiument is marketed undei tire title “The Jones Dominance-Submission 
Test,” employeis, teachers, 01 olhcis with limited psychological liammg may 
think that a person’s dominance is being directly measmed If the instillment 
meiely asks the subject a series of questions about lumself, be tun descnbe 
himself in any way he likes. A title such as “The Jones Dominance (/ues- 
tionnane” 01 “The Jones Dominance Inventoiy” is less likely to give an 1111- 
pi ession of tiustwoithmess than “The Jones Dominance Test ” It is desuable 
to use some term such as questionnaire whenevei the woid test might be 
mismteipieted 

18. Classify each of the following procedures as a test of ability, a self-report tost, 

observation in a standardized situation, or observation in an unstandardized 

situation 

a An interviewer from the Gallup poll asks a citizen how he will vote in a 
coming election. 

b. A television producer wishes to know what program features appeal to 
different types of listeners He presents a show to a small audience, who 
press signal buttons to indicate whether they en|oy or dislike what they 
are seeing at each moment 1 

c A test of "vocational aptitude" asks the subject how well he likes such ac¬ 
tivities as selling, woodworking, ond chess 

d. A spelling test is given to applicants for a clerical |ob. 

e. Inspectors in plam clothes ride buses to determine whether operators are 
obeying the company rules 
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f. During an intelligence test, the examiner watches for evidence of self- 
confidence or its absence. 

g. In a test of "application of principles in social studies,” students are told of 
a conflict about admitting Negroes to a housing pro|ect. They are asked 
what the city council should do and to give reasons to support the choice 

h An inspector in a stocking factory is supposed to detect all stockings with 
knitting faults. To chock her efficiency, at certain times a number of faulty 
stockings which have been marked with fluorescent dye are mixed into the 
batch for inspection. The dye is invisible to the worker, but by turning an 
ultraviolet lamp onto the stockings after inspection the supervisor can 
readily locate the faulty stockings which the inspector missed. 


Procedural Terms 

Them aie a number of miscellaneous terms designating tests according to 
thcii piococlme The meaning of such teams as pennl-and-papcr lest, ap¬ 
paratus test, mill test , and so on should he obvious Although all tests call foi 
peifonnanco of some suit, the name pcifornumcr lest, is usually applied to 
tests requiung a nonvcibal 1 espouse Among the peifoimaneo tests which 
have been used foi \ annus pm poses are lepaning a piece ol elecLiomc ap- 
paiatus, chawing a pielnie of a man, stimgmg beads, and “inventing" a hat- 
lack when given two long sticks and a C-elamp 
Group tests dillei fiom individual tests in that the foimei pci mil many 
subjects to be tested at once Gioup tests can be given to a single individual 
if that is dcsuablc Many inclivulu.il tests lequne caiclul oial questioning 01 
obseivation of icaetions Some individual tests can be modified and simpli- 
fied to peiniil group adinimstialion An example is the Roisehach lest of 
pcisonalitv In the individual foi in of that test, a subject looks at a caul bail¬ 
ing an inkblot and tells what he thinks the blot looks like*, He is questioned 
about each lesponse until the teslei is sum just what the subject sees In the 
gioup font), tlie Idols aie piojecled onto a seieen Subjects wiitc then re¬ 
sponses, and nulls iclu.il questioning is omitted, 

Anolhei meaning foi f’loup test has developed m lecenl vcais The team 
now often iclris to jnocedmes loi studying the behasioi ol the 1 imlivulu.il in 
a gioup To obseive leadeislup, initiative, and leacliou to opposilicm, six poi¬ 
sons may he asked to weak logothei to solve a pioblem The, behavioi of 
each person is obscived 

19. Classify each of the following tests, using as many of the descriptive terms dis¬ 
cussed in the text as are clearly applicable 
a The Study of Values consists of printed questions, such as 

In your opinion, can a man who works in business all the week best spend 

Sunday in 

a trying to educate himself by reading serious books 
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b. trying to win af golf, or racing 

c. going to an orchestral concert 
d hearing a really good sermon 

The subject answers each question by checking whichever answer he pre¬ 
fers. Answers are scored by a numerical key to determine how important 
"aesthetic," "religious," and other values are for him 

b. In the Stenquist Mechanical Aptitude Test, the subject marks illustrations of 
tools and other objects to show which go together (e.g., hammer and 
anvil). 

c. A Picture Arrangement Test item presents a set of four pictures which, ar¬ 
ranged in the correct order, tell a story in the manner of a cartoon strip 
(see Figure 33) Each picture is on a separate card. The cards are presented 
in a random arrangement and the subject arranges them to make an in¬ 
telligible story. 

d. In a finger dexterity test the subject mounts washers on rivets and places 
each one in a hole on a special board, working as rapidly as possible 

20 Classify the procedures used by Miss Kimball, the school psychologist described 

in Chapter 1, according to the terms used in this chapter 


Suggested Readings 

Baldwin, Alfied L The role of an “ability” constiuct m a theory of bchavim, In 
David C. McClelland & otheis, Talent and society New Ymh, Van Nostruml 
1958 Pp 195-233 

Baldwin discusses the natuie of ability and the theoietical mpiuntionts of 
ability tests His aigument that only voluntaiy behnvioi slums dbihtv, as 
distinguished fiom habit, amplifies our distinction between mn\nnum pei- 
formance and typical peiformance 

Bingham, Walter V. On getting lattled Personnel Psychol, 1950, 3, lOo-lll 
This aiticle describes some appaiatus tests foi measuring cooulination winch 
with slight modification, can be used to observe tempeumient 
Melton, Arthur W (ed ) Pioblems and techniques of mass testing with appaiatus. 
In Apparatus Tests Washington Government Pimtmg Office, 1947, pp 22-53, 
The aptitude testing piognrm discussed m these icpoits was the must 
elaborate one evei conducted This chaplei shows what had to ho taken 
into account in standardizing pioceduies so that men tested m California 
could be compaied precisely with men tested m Texas 
Munn, Norman L. Intelligence, and The assessment of pcisonalilv Pwchaloay 
(3id ed ) Boston Ploughlon Mifflin, 1956 Pp. 48-81, 170-181 ' 

An mtioductoiy textbook descubes a gieat vanely of pi eminent tests, 
giving diswings or photographs of most of them The chaplei on intelligence 
also piovides consideiahle information on the natuie and giowlh of intelligence 
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Administering Tests'" 


SOME tests arc sufficiently simple for any intelligent adult to give success¬ 
fully; otlieis are so subtle that months of special liaining aic lcquued before 
the tester can do a fully olleclive job. In geneial, gioup tests lequne less 
tunning to administer than individual tests, although tlieie aie some excep¬ 
tions. If the testei has no lesponsibility save to lead a set of jninted direc¬ 
tions, any conscientious, nonthiealcning poison should be successful. Wheio 
it is necessaiy to question the subject individually and to use follow-up 
questions if tin 1 fiist answoi is uncleai, gieal skill and expeiiencc aie re- 
quued. 

The testei must take pains to give every subject a chance to exhibit his 
ability, and to obtain insults computable to those of othei testeis. The im¬ 
portance of ngoums atlheience to prescubed testing proceduic is especially 
obvious in the gieal competitive testing progiams ioi scholuislnp awards 
and college admissions. The Scholastic Aptitude Test of the College En¬ 
trance Examination Boaid, the most pionunent example, is given in 1000 
Ameiiean centers and 40 loicugn ones At 9 am. on a paitieular Saturday in 
Januaiy, the seal is bioken on the test package m each centei, in Bronxvillc 
and Beikeley and Bell Buekle, Tennessee, in Beiuit and Kuhasaki and 
Kodaikanal. Tlu* completed papei.s pom into the seoimg centois and reports 
go out to the colleges A boy tested in Bcinil may be m competition with one 
bom Beikeley loi admission to the .same college, and the selection piooedtno. 
is unfair unless the two are tested m an identical manuei. 

Tqaxsuie fan testing, the testei must become thoroughly iamilnu with the 
lost Even a simple test usually piesents one oi two stumbling blocks which 
can be anticipated il the testei studies the manual in advance. 

The tester must maintain an nnpaitial and scientific attitude. Testeis are 
usually keenly interested in the peisons they test, and desne to see them do 
well As a lesull, the beginning testei is tempted to give hints to the subject 
or to coax him toward greater elloit It is the duty of the tester to obtain 
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from each subject the best lecord he can produce, but he must pioduce this 
by his own effoits, without unfau aid The testei must learn to suppress not 
only dnect hints but also those unconscious acts which serve as cues to the 
subject 

This is especially a problem in individual testing, where each question is 
given 01 ally On a mental-test item where the child is supposed to leceive 
only one tual, his answei may show that he did not compioliend the ques¬ 
tion. The tester will often be tempted to lepeat the question "since the child 
could ceitamly have answered coirectly if he had undeistood what was 
wanted”, this must not be done, since the test directions permit only one 
trial Adjustments aie sometimes wananted, however, for example, the re¬ 
sult might be discarded (ratliei than scored as wiong) if an outside dis¬ 
turbance caused the child’s failure 

Unintended help can be given by facial expression 01 woids of eneoui uge- 
ment The peison taking a test is always concerned to know how well lie is 
doing, and watches the exammei foi indications of his success Suppose ho is 
given the task: "Repeat backwaid, 2-7-5-1-4 ” lie may begin “4—1—7 , 
if the exammei, on heaung the “7,” peirmts his facial expression to change, 
the subject may take the hint and catch his own mistake. The exammei must 
maintain a completely unreveahng expiession, while at the same tune si¬ 
lently assuiing the subject of lus intciest in what lie says 

Maintaining rappoit is necessary if the subject is to do well That is, the 
subject must feel that he wants to coopciate with the tester. A leacliei who 
knows and likes a child, oi a counselor who has winked with an adult, can 
often secure moie spontaneous and repicsentative peifoimance than a staan¬ 
ger called m to admmistei tests, Those who aie acquainted with the subject, 
however, will be less impaitial and must be unusually cucumspeet in follow¬ 
ing proceduies. No mlcs can be given foi the establishment of ntppoit, but 
tire testei who likes people will develop many techniques The peison who 
pioceeds coldly and “scientifically” to admmistei the test, without convincing 
the subject that he legaids him as an iinpoitant human being, will hc- 
quently find it difficult to maintain coojieialion Pooi lapjxiit is evidenced 
by inattention duimg dnections, giving up beJoic: time is called, leslless- 
ness, oi finding fault with the test. 

This chaptei gives a gcncial introduction to test adinimstialiou, It cannot, 
of couise, make the lcadei into a skilled testei, that conics only with pint lice 
To claufy our discussion m this and the next ihiee chapters, we digicss here 
to describe the Bennett Test of Mechanical Compiehension and the Block 
Design test in some detail These tests aie nnpoitant m themselves, but wo 
piesent them heie so that we can refer to them to lllustiale gcnei a] pnnciplcs 
of testing If possible, the reader should take each of those tosts himself. 
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TWO SPECIMEN TESTS 

Test of Mechanical Comprehension 

mie Test of Mechanical Comprehension (TMC) originated by George K. 
Rennelt is one of the most widely used tests m the “special ability” group 
The first foim appeared m 1940 Foui fonns weie published under Bennett’s 
name, othei veisions have been included in militaiy classification batteiies 
and vocational aptitude tests, and a iceent veision is contained in the im¬ 
portant DAT battery foi high-school guidance. A list of the fonns and then 
proposes was given m the catalog excerpt in Chapter 1 (p 14) 

The lest manual (Bennett, 1947) begins with this description of pin pose 1 

The Test of Mechanical Compichensimi rncasmes the ability to peiceive and 
undeistand the lelalionship of phssical fences and mechanical elements m puielical 
situations This tvpc* of aptitude is nnpmlant for a wide vaucly of jobs and foi 
engineunng and many tiade school courses 

Mechanical coiiipiehcnsioii mav he icgarded as one aspect of intelligence if 
intelligence is Deadly defined The poison who semes high m tins li.ut tends to 
leam leachly the pniiciples ol opeiatimi and lepan of complex devices Like olhci 
aptitude tests, it is niflneneed liy onvnonmontal f.ulois, but not to an extent that 
mtieduces impel taut difficulties m lnleipielalion Foi null tiaimug m phvsies ap¬ 
peals to inciease the seme by not rnmo than 4 points (auc has been taken to pie- 
sent items in lei ms ot simple, fieqnetitlv encounteicd mechanisms that do not le- 
semblo textbook illustiations m letjune special knowledge^ 

The test booklet eames instmetions to the student and chaws his attention 
to two specimen items (Figure 4) The manual caincs the following dnoc- 
tions to the testei. 

This lest lias no tune limit Oidmarilv, a gieat majonly complete the tost m 
twenty to twenty-five minutes, little is gamed liv allowing nioio than tliiilv 
minutes 

Aflet disliiluiling the booklets and answei sheets, sav You bane hern gum a 
test booklet emit timing tjiie.slious and a srptimlr short foi i/iwi answri.s Hr s me to 
wiitr on ouh/ the nnsu n ,short Make no mmks on the booklet itself 

Now look nt the chin lions punted on the coon of tjoui test booklet while I mid 
them idoud to you, 

lull in the ictpieslrd mfounalhm on your ANSWER SHEET [F, g , name, age, 
sex, dale, last guide completed ] . 

Now hue up iioiii unsuci .sheet with the test booklet so that the "Page i” 
arrow on the booklet meets the "Page 1” airow on the answer sheet Demonstiate 
Then look at Sample X on this page It shows pictures of two rooms and asks, 
"Which mom has nunc of an ccho ? " Because it has neithct mgs not cuilams, there 

1 Directions, noons, and specimen items reproduced by permission Items copynglit 
1941, 1947, by The Psychological Corporation 
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FIG 4 Mechanical Comprehension Test items (Sample X is from Form AA of the Bennett test, and 
Sample Y from Form A of the DAT Mechanical Reasoning Test Both items used by permission of 
The Psychological Corporation Bennett item copyright 1940, 1941, 1955, DAT item copyright 1947) 


is more of an echo tn room “A,” so blacken the space under "A" on ytnir answer 
sheet Now look at Sample Y and answer it yourself. Fill m the space undo the 
correct answer on your answer sheet Are there any questions? If the answers on 
the answer sheet are not directly opposite the questions, reuse your hand 

Aftei Sample Y has been answeied, say On the following pages there uie mine 
pictures and questions. Read each question caicfully, look at the pit lure, and fill 
in the space under the best answer on the answer sheet Malm swe that your marks 
are heavy and black Erase completely any answer you wish to ihanga. lie ter tain 
that you use the right column on the answer shed for each page. The arrow on the 
page should meet the arrow on the answer sheet. These arrows are at a different 
place on each page to help you 

Now open your booklets and fold back the cover so that only Page 2 shows , 
like this, Demonstrate Then slip your answer sheet under the booklet and lure it 
up so that the arrows for “Page 2” meet, like this Demonstrate When you finish a 
page, go right on to the next Now begin the test Answer all the questions; you 
will probably have plenty of time to finish If you have any questions, raise your 
hand The examiner should make sure that eveiyone understands how to use the 
answer sheet 
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If the answei sheets aie to be machine scored, the examiner should include ap¬ 
propriate dheetions regarding the special pencils. 

The answei sheet is snmlui to that illustrated on page 67. 

Block Design Test 

The history of the Block Design tost illustrates the way in which tests de¬ 
velop S C. Kolis was a clinical psychologist who invented the, pioccdure and 
made up a set of items (Kolis, 1923). It was only one among a huge number 
of mental tests invented during the 1920’s, when applied psychology first 
came into piominence As schools began to hire psychologists to examine 
childien, a demand arose for well-standardized collections of tests A psy¬ 
chologist acting as editor collected tests by vanous authors, improved the di- 
lections, malenuls, and scoring procedures, and applied the whole set to a 
large gioup of typical pupils to obtain slandaids of compaiison. Seveial such 
collections weio made, including those of Giace Aitlnir and Pintnoi and 
Palcison, each being designed to fill slightly diffeient needs. The Block De¬ 
sign piocedme was used in many of these collections, being a good nouvei- 
bal measure of analytic and synthetic reasoning with a wide Hinge of diffi¬ 
culty, Revision and reslundardi/.ulion has continued down to the piesent 
clay. Each modification alteis the numhci of items oi the dncctions, or intro¬ 
duces new designs We shall describe the test m the recent WAIS version 
(Wcchsler, 1955, p 47). 

The test makes use of a set of coloicd one-inch cubes ongmally sold for 
chilchen’s play. The test instructions begin as follows: 

Start with Design 1 foi all subjects. Tako fom blocks and say You see thaw, 
blocks. Thai / arc all alike. On some sides thci/ ate all red, on some, all white, and 
on some, half rod and half white Turn the blocks to show the different sides 
Then say I am going to put them logetha to make a design. Watch me Annnge 
the foui blocks slowly into the design shown oil Cm cl 1, without exposing Cm cl 1 
to the subject Then, leaving the model intact, give fom other blocks to the subject 
and say Now make one /inl like this II the subject successfully completes the* de¬ 
sign within the time limit, seme 4 points ami pioeoed to Design 2 

If the subject kills to complete the design within the lime limit m auanges the 
blocks mcouec llv, pick up his blocks, leaving the exmmnei’s model intact, and say 
Watch me again Domonsliale a .second tune using subject’s blocks, then mix them 
up, still leaving the exammci’s model intact, and say Now you tuj it and he sure to 
make it just like mine Whether subject succeeds or fails on this timl, pioceed to 
Design 2 

Occasionally a subject will tty to duplicate the cxammei’s model exactly, in¬ 
cluding the sides When this occurs on Design 1, tell the subject that only the top 
needs to be duplicated. 

A second sample is then administered rn a simrlai manner. The test proper 
begins wrth Design 3. The directions are. 


I 
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Designs 3-10 Place the caid for Design 3 befoie the subject and provide him 
with foui blocks Say Now make one like this Tell me when you have finished 
When the subject indicates he has finished or at the end of the tune limit, mix up 
his blocks and piesent Design 4 with the lemaik Now make one like this C,o 



Blocks Pa,torn 


FIG. 5 Block Design test materials (Pattern copyright 1940, © 1955 by The Psy 
chologica! Corporation Reproduced by permission,) 

ahead, let me know when you have finished. Follow this pioceduio fin all succeed¬ 
ing designs 

When Design 7 is leached, take out the five othei blocks and sav Now make one 
like this, using nine blocks Be sure to tell me when you have finished Fm Design 
10 [which has an megulai outline], do not peimit the subject to lotate the caul to 
give the design a flat base Howevei, give lull credit if Ins icpioduetiou of the 
design is lotated not more than 45°, 

Time Limits Designs 1-2 60 seconds (Tune each trial separately) 

Designs 3-6 60 seconds 

Designs 7-10 120 seconds 

Recoid time taken foi the subject to complete each design if it is done correctly 
within the time limit, bonuses aie given foi lapid pcifoimances on Designs 7-10 
Discontinue Aftei 3 consecutive failuies Failuie on both hulls of eilliei De¬ 
sign 1 oi Design 2 is consideied one failuie 

As in otliei individual tests, the tester obseivcs the subject’s peilommnce 
with caie ITe notes the lime reqmied to complete each task, and any enois, 
In addition, he watches foi any revealing remark, any emotional inaction on 
blocking, and any unusual method of attacking the tusk Some pci sons me 
cautious and some aie impulsive Some deal with the pattern as a whole and 
some must consider each tiny section in turn Some give up when they face 
difficulty, some become enatic and make the same cuoi lepeatedly, and 
others show increased inteiest undei the gieater challenge 

1 If the TMC were administered individually, could profitable observations be 
made’ 

2. Can you think of any questions a subject might ask which the TMC directions do 
not cover' 5 
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3. Wechsler's directions specify only that the blocks are to be "mixed up" after 
each trial. Could this procedure be standardized more exactly? Should it? 

4. Wechsler prescribes that each sample should be demonstrated only twice. Even 
if the sub|ect is unable to do the task on the second sample, the tester proceeds 
to the next design. Is this a wise procedure? 

5. The manual is not regarded as sufficient to prepare one to give the Wechsler 
test. The tester learns by observing an experienced tester and discussing pro¬ 
cedures with him. What do you think you could learn about giving the Block 
Design test that the manual did not tell you? 

PROCEDURE FOR TEST ADMINISTRATION 
Conditions of Testing 

Certain general niobiums of administration are common to all tests. The 
first of these is the physi c al sit uation) wlieic the test is given. If ventila tion 
and lighting are pool, subjects will be handicapped. On speed tests par¬ 
ticularly, their scores will he lower than they descive if they do not have a 
convenient place to wnte, including sufficient space to spread out materials 
Subjects must he placed so that they can hear diinotions and see demonstra¬ 
tions cleaily. Very huge looms aic generally had for gioup testing, unless 
proctors aic stationed to watch subjects closely. The large room has the dis¬ 
advantage that a poison may hesitate to ask a question about uucleai dnfic¬ 
tions which he would nuse beloie a smaller audience This may be solved by 
having him raise his hand so that a pioetor will come to his seat and an- 
swe r his qu estion. v. 

'{The state of thc"poison tested {affects the losults If the test is given when 
he is fatigued, wEen Ills'irnncTTsbn other pioblems, or when lie is emotionally 
disturbed, results will not be a fair sample of his behavioi. Occasionally it is 
necessary to test a person at an unfavorable time, as when psychological ex¬ 
aminations must he given to a criminal at the time of his trial Tests to he 
used in classification and guidance of college fieshmcn aie frequently given 
in the midst of a hectic week of oimutation, college activities, establishment 
of new friends and living anangenients, and adjustment to a semiadull 
world. Sometimes a freshman who later pioves to be noimally intelligent 
does very badly on placement tests because ol homesickness, distiaction, 
emotional exhaustion, or unidentified causes While tests given undci these 
conditions do have piedielive value fm most of the gioup, some individual 
scoies are misleading. If a test must he given at a psychologically moppoi- 
tune time, the only conect piocedure is to maintain an adequately ciitical 
attitude towaid lesults. Conditions can often be impioved by spacing tests 
to avoid cumulative fatigue, pioviding for adequate rest on the night before 
tests, and administering the program with a minimum of bustle and con¬ 
fusion 
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During Woild War II, the Geneial Classification Test was often given to 
soldiers just after induction when they lacked sleep, were recoveung from 
a farewell paity, or fell ill fiom inoculations In one study men who took a 
second form of the test after becoming stabilized in Army 1 onlines laised 
their scoies 11.25 points on the aveiagc. This is a laige enough shift to raise 
a man from the category of potential noncom to that of potential officei 
( Duncan, 1947 ) 

o r d d p .re- ? >■ . cores, but it is laiely important. Aleit subjects 

an ' ' 1 o " 1 1 ■ ,>st than subjects who aie tued and dispiiited 

Equally good results can be pioduced at any lioui, however, if the subjects 
want to do well In most instances, fatigue appaicntly affects motivation 
rather than the ability one can summon up The most thoiough examination 
of lioui -by-hour vanation was conducted by An Force psychologists (Mel¬ 
ton, 1947, pp 49-51) In one study of 2500 cadets being classified at Buckley 
Field, Coloiado, they lound staking and significant cliffc’lcnees in psyehoino- 
tor test peifoimance (finger dextenty, ludder contiol, discrimination reac¬ 
tion time, etc.). In geneial, peifoimance was at its peak between 10 a.m. 
and 3 pm, In an attempt to confiim and mtoipiet this difference with fui- 
thei tests of neaily 9000 cadets at otliei places, negligible differences were 
found. The inconsistency has not been explained, but it appeals th.it under 
most opeiatmg conditions fluctuations duiing the day can be avoided. The 
experience at Buckley Field warns the teslei never to close Ins mind to the 
possibility that enoi may enter his tests from unexpected sources 

Control of the Group v 

Group tests aie given only to reasonably matuie and coopeiativc subjects 
who expect to do as the tester lccjuesls. Group testing, then, is essentially a 
pioblem m command Foi efficient testing, subjects must follow instructions 
promptly and all must do the same thing. This attitude must bo maintained 
without mleifeiing with the oppoilumly of individuals to ask questions. One 
poison should be m chaigc, standing in fionl of the gicmp when- he can see 
all members He will find helpful the adage “Never give an ordei unless you 
expect it to be obeyed” False starts, pielimmuiy attempts to call the gump 
to order while late-comeis aie finding seals, and ineffectual tapping tor at¬ 
tention make it more difficult to secuic ical conformity when walk begins. 
The lestei should have full attention when he stalls to talk, so that ^peti¬ 
tion will not be necessary 

Dnections should be given simply, cleaily, and singly, A complex instruc¬ 
tion. “Take your booklet, turn it face down, and then write yom name on 
the answer sheet,” will lead to misunderstanding and confusion. It is better 
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to bieak the instruction into unmistakable simple units: “Take your booklet,” 
(Hold a sample up, and watch the group to be suie cveiyono has taken his 
booklet bcfoie proceeding.) "Turn it face down” (Domonstiate and wait 
until everyone lus complied.) “Now take youi answer sheet.” (Exhibit a 
sample, and wail for compliance ) “Write your name on the blank at the lop, 
last name Hist.” The subjects have a chance to ask questions whenever they 
aie necessary, hut the examiner attempts to anticipate all reasonable ques¬ 
tions by full dilections 

Military techniques arc effective for control of a gioup. When a military 
mannci is assumed, however, it may enhance the “inhuman” character of the 
lest situation and give some people the feeling that the examiner is not in¬ 
terested in llu'ii welfare Effective control may he combined with good rnp- 
poilif the examinei is fimildly, avoids an antagonistic, oveibcaiing, 01 fault¬ 
finding altitude, and is informal when foimal control is not called for. After 
establishing eonliol, for example, he may often relax his “command mannei” 
and make informal c ommenls about the lest and its pm pose, this does not rn- 
teiieie with Ins resuming formal eonliol for the test proper. 

Emergencies an.se which pi event uniioim testing ol all pel sons Occasion¬ 
ally, foi example, a person becomes ill dining the lest and must leave the 
loom. Usually it will be possible to collect Ins malenals, indicate that the lest 
is invalidated, and provide loi a make-up on another occasion, peihaps with 
a different foim of the lest The goal of the tcstci is to obtain useful informa¬ 
tion about people. There is no value in adheiing rigidly to a testing schedule 
if that schedule will not give true information. Common sense is the only safe 
guide in the exceptional situation. 

6. An employment office gives all applicants an intelligence test when their ap¬ 
plications are filed. One man takes the test, together with several friends, and 
the group leave together. Ten minutes later he returns, greatly agitated. "Was I 
supposed to turn over the last page? I thought I had finished when I got to the 
bottom of page 9, so I looked back over my answers 1 had plenty of time, and 
I'm sure I could have done well on the last page—my friends say the questions 
thero were easy " What should bo done in this case, if at tho bottom of page 9 
the booklet carried the printed statement "Go on to the next page"? 

7. In testing a group of college freshmen to obtain information for use in guidance, 
the examiner finds that a student newly arrived from Latin America is having 
great difficulty following directions because of unfamiliarity with English. The 
student asks many questions, requests repetitions, and seems unable to compre¬ 
hend what is desired What should the examiner do? 

8. In the course of a clinical analysis of a preschool child who is believed to be 
poorly adjusted, a report on a series of tests is requested. The psychometrist 
who gives the tests finds the child is negativistic After cooperating reluctantly 
on two tests he becomes inattentive and careless on the third. Assuming that the 
test results are needed as soon as possible, what should the tester do? 
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p Directions to th e Subject ^ 

The most important responsibility of the test administrator is giving direc¬ 
tions The puipose of standaidized tests is to obtain measurements which 
may be compared with measurements made at other times; it is therefore 
imperative that the tester give the directions exactly as provided in the man¬ 
ual. If the tester understands the importance of this responsibility, it is sim¬ 
ple to follow the punted directions, reading them word for word, adding 
nothing and changing nothing 

The standaid directions usually invite the subject to ask questions after 
the chiections have been leach In answenng such questions, the tester 
must not add to the ideas expressed in the standard directions, since such 
supplementation might give this subject an advantage ovei those not having 
such aid The directions are pait of the test situation, m some intelligence 
and personality tests the way the subject follows directions is intended to in¬ 
fluence his score. 

The most troublesome questions concern matteis not discussed in the 
standaid duections. Examples are “Should wc guess if we aic not ceilum?” 
“How much is taken off foi a wrong answer?” "Aie thcic any catch ques¬ 
tions?” “If I find a hard question, should I skip it and go on, or should I an¬ 
swer every question as I go?” The published duections to the test were 
evidently not adequate if they ignoied these topics. When the tesloi leftist's 
to give an answer to the questions about guessing—and he must refuse if the 
scores are to be compaied with norms—some subjects will guess and some 
will not. Therefoie, while the directions ate superficially standaul, the pro¬ 
cedure becomes unstandaidizcd because subjects inteipiel indefinite in¬ 
structions, each m his own way 

Attempts to test skill m flying have shown the ciucial impoitanco of de¬ 
fining the task cleaily for the subject. In making a check test on ability to 
execute a maneuver, testeis found it necessaiy to tell the pilot exactly how 
the perfoimance would be scoied When they omitted this, one pilot kept 
his attention on maintaining altitude peifectly, whoieas anothci of equal 
ability earned a diffcient score because he coneenluitcd on holding the 
planes heading steady. Tests should be piovided with directions which 
leave no ambiguities for vanable mteipietation When the testa must use a 
standaid test for which duections aie impeifect, he faces a difficulty foi 
which there is no ideal solution 

9. The Bennett TMC directions are printed in the student's test booklet, and also 
read word for word by the examiner (The only difference is the sentence in 
which the examiner asks if there are any questions) Why is it desirable to read 
the directions aloud instead of allowing the student to read to himself? 

10 How would you answer the following questions raised by students after hearing 
the TMC directions? 
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a. Is this a speed test? 

b, If I am not sure of the answer should I mark what I think is best? 

11. The California Test of Mental Maturity consists of twelve sections, each con¬ 
taining a different type of item The sections are separately timed, each re¬ 
quiring 3-10 minutes. Is there any reason why a high school seeking data for 
guidance should not give pupils one or two sections of the test each day until 
all of it is taken, rather than giving it in one or two sittings as the manual 
suggests? 


judgments Left to the Examiner 

While the dnections will he standaidized m many inspects, it is unwise to 
standaidize the testei’s pioccduies too ligidly Precisely the same action oi 
remaik by the examiner can have a diffeient significance foi difTcicnl sub¬ 
jects’, and if so, ligid ptoccdurc itself nitiodiices an unstandaidr/cd element 
into the testing This may he illustrated fust with lcgaid to the pioblcm of 
teiminatmg a lest. 

Dnections foi most tests place some limit on the tune allowed to solve 
any pioblcm oi to vvoik on any subtest (see the Block Design directions, lor 
example) To confmm to the dnections, the leslei need only use lus slop 
watch attentively. Where no tune limit is staled, it is still nocessaiy to slop 
the painfully conscientious suhjec t who woiks long aflei lie has done Ins best 

Sometimes an individual lest allows eiecht only when a task is done m 
(say) two minutes, hut does not tell the examiner to stop the subject at that 
tune. The leslei must decide whethei to let the subject woik allei be has 
passed the eiedit limit oi to intemipl him. This is one oi the situations 
wheio the ait ol testing comes into play, no mles can presciibe how to ter¬ 
minate an unsuccessful tual. 

In an ability lest like Block Design, success on one pioblcm has an encom- 
aging effect dm mg the next, but the effect of failuie depends oil the tcstei. 
In the testei's eyes, the subject fails when he does not complete the task 
within the tune limit If the subject is allowed to continue without mien op¬ 
tion, however, lie may finish the task and think tli.il he has succeeded. This 
ol com so lends to help lus subsequent pcifcnmancc On the olliei hand, 
even m extra time the subject may be unable to solve the pioblcm When he 
appeals to be gelling contused and upset, it may be best to terminate the 
pioblcm and give him a liesli stait. To let linn continue might make him 
hopelessly discern aged In the absence of definite msli notions on piocedmc, 
the testci should obseive the subject’s attitudes caiefully and choose what¬ 
ever corn sc seems likely to have the best effect on his subsequent peifoim- 
ance 

Terman and Merrill (1937, pp 55-58), m then advice to exammeis giving 
the Stanfoid-Bmet mental test, give tuithei lllustialions of the necessity foi 
variation in tactics. 
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The tests of each year group should be given in the order in which 
they appear in the manual . In order to secure the child's best ef¬ 
fort, however, it is sometimes necessaiy to change the test sequence 
For example, if the child shows resistance towaid a ceilain type of test, 
such as repeating digits, drawing, etc, it is beltei to shift temponuily to 
a moie agreeable task When the subject is at his ease again, it is usu¬ 
ally possible to letum to the tioublesome tests with beltei success. Such 
difficulties aie paiticulaily likely to be encounteied m the testing of pre- 
school childien. This gioup presents so many special pioblenis that wc 
have felt it necessaiy to give a separate discussion of the techniques of 
pie-school testing 

The examinei’s first task is to win the confidence of the child and to 
oveicome any timidity he may feel in the piesence of a stiangei, Unless 
lappoit has fust been established, the results of the first tests - are likely 
to be misleading. The time and efloit necessaiy foi accomplishing this 
arc vanable factois, depending upon the personality of both the exarn- 
inei and the subject. It is impossible to give specific rules for the guid¬ 
ance of the examiner m establishing ruppoit, The addicss which Hath'is 
and pleases one child may excite disgusL m anothei The examine! must 
himself be genuinely mlciestcd and fuendly 01 no amount of skilled 
technique will enable him to establish a sympathetic, understanding re¬ 
lationship with childien. Theie are people who lack personal adaptabil¬ 
ity to an extent that makes success in this field for them impossible, 
Such a peison has no place in a psychological clinic. 

Nothing contubutes moie to satisfactory rappoit than keeping the 
child encouraged This can be done in many subtle, friendly ways; by an 
understanding smile, a spontaneous exclamation of pleasure, an appre¬ 
ciative comment, 01 just the ail of quiet understanding between equals 
that cames assuiance and appreciation Any stereotyped comment fol¬ 
lowing each test becomes perfunctory and seives no pmpo.se other than 
to punctuate the tests. In geneial it is wise to praise frequently and gen¬ 
erously, but if this is done in too lavish and stilted a fashion it is likely to 
defeat its purpose The examina should remember that lie is giving ap¬ 
proval pnmaiily for effort lather than foi success on a particular re¬ 
sponse To pnuse only the successful responses may influence efloit in 
the succeeding tests. Piaise should nevei be given between the items of 
a paiticular test, but should be reseived until tire end of that test [i.c., 
subtest] Uiidei no cncumstances should the examiner permit himself 
to show dissatisfaction with a response, however absuid it may be. With 
younger childien, especially, praise should not be limited to tests on 
which the child has done well. Young childien aie clnu jctenstically un- 
cutical and aie often enormously pleased with veiy mfenoi responses. 
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In piai,sing poor performances of older subjects, the e\ammei should le- 
membcr that the purpose of commendation is to insure confidence and 
not to reconcile the subject to an inferior level of response In the case 
of a failure that is ombaimusingly evident to the clnkl himself, the 
examiner will do well to make some excuse foi it. Expressions of com¬ 
mendation should be vanod and should fit naliually into the conversa¬ 
tion. 

Although the examiner should always encouiage the child to believe 
that he can answer correctly it he will only tiy, he must avoid the com¬ 
mon piactiee of dragging out responses by loo much urging and cioss- 
questronmg. To do so often lobs the response of significance and is likely 
to inteifeie with the maintenance of uipport While the exaimnei must 
be on his guard against mistaking exceptional timidity foi inability to 
respond, he must also he able to lecogmze the silence of incapacity 01 
the genuineness of an "I don’t know ” 

The competent cxaminei must possess m a high degico judgment, 
intelligence, sensitivity to the reactions of others, and penetration, as 
well as knowledge of and regaid for scientific methods and experience 
in the use of psychometric techniques. No degree of mechanical per¬ 
fection of the tests themselves can ever take the place of good judgment 
and psychological insight of the examiner. 

12. The Bennett directions are vcigue as regards timing "Little is gained by allow¬ 
ing more than thirty minutes " The DAT form of the test is definite "At the end 
of 30 minutes, say 'Stop!' " What are the advantages and disadvantages of 
the two procedures? 


Guessing 

At the start of an objective ability test, some subject is likely to ask, 
“Should I guess if I am not cerium?” Sometimes the test dueelrons include 
an answer to tins question, hut even where such advice is given, some ambi¬ 
guity lcrnuins. It is against the rules for the tester to give supplementary ad¬ 
vice; he must reheat to such a loimula as “Use your own judgment." The dis¬ 
cussion which follows is intended to claniy the guessing pioblem lot the 
tester but should not influence his proccdme in giving tests. 

To simplify the discussion, we can speak as d items fall into two cate¬ 
gories those for which the subject knows the answer, and those foi which he 
does not know it II the item calls for a choice of alternatives, the subject has 
a chance of picking the correct response even on the items he does not know. 
If theie aie two alternatives, as m tiue-falso items, he will succeed by chance 
alone on 50 percent of his guesses. In scoring a two-choice test, we assume 
that any wrong choice represents an unlucky guess, and that the number of 
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lucky guesses is equal to the number of wiong guesses. The final score on a 
true-false test is counted as “number of items right minus number marked 
wiong," i.e., total numbei of items marked coirectly less the number thought 
to have been marked conectly by guessing If theie are n choices per item, 

the chance probability of a conect guess is i and that of an unlucky guess is 
——— For eveiy n — 1 incorrect guesses, we expect 1 correct guess. Hence 

the sconng foimula most often used is “Right minus Other scoring 

foimulas have been developed, some of which aie piobably supezior to this 
one, but none of them is much used In a test with a libeial lime allowance 
and compaiatively easy items, subjects usually maik eveiy item When that 
happens, the lank oidei of the scoies remains the same whcthci the scoie 

W ^ 

used is “numbei light” oi R — --. 

° n - 1 

A collection foimula is desned because some people guess 11101 c heely 
than otheis The guesseis would maik many light answers by chance alone. 
The scaling formulas attempt to wipe out gams due to guessing, Unfortu¬ 
nately, the basic logic described above does not describe the situation fan ly, 
and the formula docs not tiuly “concct foi guessing ” The basic assumption 
is inconect, You cannot divide items into those the subject knows peileelly 
and those he does not know at all. Theie aie items he knows fanly well but 
is not positive of, and othei items wheie lie has hazy knowledge “Guessing” 
is not a mattei of puie chance. Even on the items ho knows least about, the 
guessei’s experience and common sense should permit him to choose cm- 
rcclly moie often than he would if he selected answeis by lolling dice A poi¬ 
son who guesses intelligently on ten five-choice items can exjiect to get 
peihaps foui items light, instead of the two items expected fiom chance 
guessing By formula, four right answeis would give him a scoie of 2)1 points 
Since a peison who does not guess leceives a scoic of /eio on the same 10 
items, the scoie is raised by willingness to gamble 

The subject decides what lisk lie is willing to inn Some people maik only 
the items they aie veiy suie of, Otheis maik any item they think they under¬ 
stand, and still otheis maik absolutely eveiy item Tins chdeienee in tend¬ 
ency to gamble is not eliminated by any change of duet lions 01 penalties, 
As the penalty becomes moie seveie, guessing diminishes, but the bold still 
take moie chances than the timid (Swmefoid, 1941; Toimncc and Ziller, 
1957). 

Even if the standaid "collection foi chance" is used, the peison who gam¬ 
bles on eveiy doubtful item is likely to gam more than lie loses The only ex¬ 
ception is wheie the test constiuctoi is so skillful m writing misleading al¬ 
ternatives that the guessei is likely to pick them m piefeience to the right 
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answei Theiefoie the person taking a test is usually wise to guess fieely 
(But lemcmber that the tester is not to give lus group an advantage by 
telling them this trade scciell) 

Fiom the point ol view of the testei, tendency to guess is an unstandaid- 
lzed aspect of the situation which mleifeies with accurate mcasuicinont 
Most Euiopean gioup tests lemove the oppoitumty fm blind guessing by 
presenting items m "open end” foim, wlieic the subject must wiite 01 chaw 
the answer Ameiiean gioup tests, howevci, are almost always in multiple- 
choice foim because these items lcrpine less time and can be scoied me¬ 
chanically. 

The systematic advantage of the guessei is eliminated if the test manual 
duects eveiyone to guess, but guessing mtioduccs huge chance cnois Sta¬ 
tistical compaiison of “do not guess” instiuctions and ‘do guess” lush notions 
shows that with “do not guess” instiuctions the tests have slightly gieatei 
piedietive powei (Cieene, 1952, pp 73-75, see also Lindquist, 1951, pp 
347IT ) Chance cnois multiply when eveiyone guesses, and then cumula¬ 
tive influence on accuiaey of measuiemonl outweighs the advantage of “do 
not guess” nisliiielions. Tlio.hiost widely accepted piaclice now is to edu¬ 
cate students that wild guessing is to then disadvantage, but to encomage 
them to respond when they can make an inhumed judgment as to the most 
leasonable answei even if they aie uncertain The lollowing advice given 
to College Boaid applicants is much fanei than stuct instiuctions not to 
guess. 

When the lest is scoied, a peicenlage of the wiong answeis is sub¬ 
tracted horn the number of light answeis as a collection foi haphazard 
guessing It is improbable, theiefoie, that mcic guessing will impiovo 
yom scoio significantly, it may even lower your scoie Ii, howevci, you 
aie not sine of the collect answei but have some knowledge ol the 
question and aie able to eliminate one oi moie of the answei choices as 
wiong, yom chance of getting the light answei is unpioved, and it will 
be to youi advantage to answei such a question. 

13. Bennett Form AA items (with very few exceptions) have two alternatives The 
test is scored R 1W, for no stated reason What effect will this formula havo 
as compared to R - W? Whom does it favor? 

14 In the difficult Form CC of the TMC, items have five alternatives What is the 
corresponding scoring formula? 

15 When scores are “corrected for guessing,” some person may receive a nega¬ 
tive score What does this mean? Is he less able than a person scoring zero? 

16. Compute scores for each of the following persons by the usual correction 
formula 

A has 20 right, 6 wrwng,^ ornltfeTb—— 

B has 22 right, 8 wr^ng, 3*oi_ 


Test, 1, true-false 
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Test 2, three-choice C has 15 right, 6 wrong, 4 omitted. 

D has 18 right, 3 wrong, 4 omitted. 

Test 3, five-choice E has 20 right, 6 wrong, 9 omitted. 

F has 6 right, 6 wrong, 23 omitted. 

17. Give a difficult five-choice test (untimed) to a friend with instructions to answer 
items only when fairly certain of the correct answer When he has finished the 
test, provide a pencil of another color and direct him to answer all the re¬ 
maining items, making the best guess he can. Determine his raw score on each 
trial with and without correction for chance. If the test manual included "do 
not guess" directions, how much would he gain or lose by guessing despite 
the directions? 

18. If you were taking a five-choice test of professional knowledge in psychology 
as a requirement for a diploma, would you mark items of which you were un¬ 
certain 7 Assume that the test is scored R. — iW 

19. In a time-limit test of mental ability using multiple-choice items, how rapidly 
should the subject work, in view of the fact that higher speed leads to more 
errors? 

20. Some instructors advocate scoring achievement tests by formulas which penal¬ 
ize guessing very heavily, such as "Number right minus twice number wrong." 
What effect would this have on validity of measurement? (Cronbach, 1941; 
Etoxinod, 1940). 

21. Should test directions tell what scoring formula will be used? 

MOTIVATION FOR TAKING A TEST ^ 

In making a physical measuiement—foi instance, weighing a truckload of 
wheat—theie is no pioblem of motivation. Even in weighing a person, 
when wc put him on the scale we get a lathei good mcasuio no mallei how 
he feels about the opeiation. But in a psychological lest the subject must 
place himself on the scale, and unless he caies about the result he cannot be 
measuied 


Incentives That Raise Scores ^ 

In an ability test, oui pioblem is like llial ol the indmirial manage) who 
wants a high late of pioduclion Elloit and productivity depend on the re- 
waid the peison foresees. The most diiccl i award foi good test performance 
is being lined for a job or being given a dcsiiable assignment. Equally pow- 
eiful and moic umveisally available as a source of motivation is “ego involve 
ment,” that is, the desne to maintain self-iespect and the inspect of others. 
Effort is stimulated also by sheei intciest m the task and by the habit of con¬ 
forming to authonty. 

The test scoie is not readily alteied by simple incentives There have 
been many attempts to raise test performance by puzes, pep talks, and 
monetary payments for mcreases m score Almost invariably, such attempts 
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fail to produce appreciable improvement on ability tests ovei tlie scores 
earned under the regular conditions of administration (!see, foi example, 
Benton, 1936, Ferguson, 1937.) These incentive studies gencuilly offeied 
rewaids foi compliance with the demand of an aullioiily who wanted the 
test scores. When the motivational pattern is shifted to mouse the subject’s 
own concern ovei his test scoie, we find that scoies can bo nnpioved. Flana¬ 
gan (1955), in a study of a huge numbei of aviation cadets and high-school 
students, compaied evidence of caieless and unmotivated performance 
under vanous conditions Ilis evidence was obtained by counting, fust, the 
number of cadets who used stereotyped patterns of maiking the answei 
sheet, such as the sequence A B A B A B, and second, the number earning 
chance scoies on easy items Even though the tests affected the cadets’ duty 
assignments, a few of these completely meaningless icsponscs weie found 
On a momoiy test, which was fanly typical of the entiic series of tests, two 
cadets pci thousand showed stereotyped patterns, and live obtained chance 
scoies, High-school students were given a similar test with no particular in¬ 
centive, merely being told that leseaioli data weie being collected Ileie 
there weie foui stei eolypcd-iespouse papeis per thousand, and 21 chance 
scoies. But m another school where the students expected that they would 
receive a lull report on the tests togethei with counseling, them weie no 
steieotyped responses, and only lluec chance patterns' pci thousand U is 
evident that leseaieli employing tests has little meaning unless the subject 
is given a personal reason for taking the lest. If lie is meiely asked to co¬ 
operate in an experiment, his i espouses may be casual oi even caieless 

Motivation to do a task well or to make a good impulsion on an adult is 
learned Dining his eaily yeais, the child develops attitudes towaid himself 
and toward task peifoiinanee which have a piofound influence on his 10 - 
sponse to tests and to school assignments The typical middle-class child 
learns to woik haul because he obtains piai.se, tangible lewaids, and special 
opportunities when he achieves well. The lower-class child veiy ollen learns 
to take assignments less seriously and to wink haiely enough to keep out til 
trouble. Ilis self-respect depends most on his icliiUons with Ins classmates 
outside of school, and relatively little on obtaining appioval hum adult au¬ 
thorities (Fells cl ai, 1951). 

Motives That Reduce Scores ^ 

The subject may frankly wash to do poorly on a lest (see Pollaczek, 1952). 
Tlreie aie times when pupils tiy to limit their seoics oil mental tests because 
a school has a classification plan m which, it is runroied, the better students 
will be lequiied to do extra woik. If, in military classification, men suspect 
that passing certain tests qualifies one for an unpopular assignment, there is 
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a temptation to fail deliberately Another instance is the boy who deliber¬ 
ately failed his school subjects so that instead of being piomoted he would 
be kept in the grade wheie his less intelligent fiiends weie to remain. 

When the subject wishes to earn the best score he can, lus vciy desire to 
do well may inteifeie with good peifoimance. When one is tense, he com¬ 
mits enois that he would leachly detect as such othmvise In psycliomolor 
tests, tension leads to pooi cooidmation and erratic movements. In a verbal 
test, the subject who feais cuticism of his answeis may attempt to escape it 
by being oveicritical of himself In clinical mental testing, anxious patients 
frequently find fault with then own answeis or elaboialc them to include 
all possible venations and qualifications In doing this they may spoil an an¬ 
swer that would have received ciedit Anxiety ovei tests is gcneiated at an 
eaily age by the attitudes of teacheis, paients, and othei chiklicn. Sarason 
and his associates developed a special qucstionuaiic to moasmo “test anx¬ 
iety,” using such items as “When the teachci says she* is going to give the 
class a test, do you get a neivous (01 funny) feeling?” (Samson ct al, 1958) 
Substantial individual difTeiences were found wlncli aie fanly stable, as 
shown by letests at a later date. The median elementaly-school fluid admits 
to 12 anxiety symptoms on the list of 43 covered by the lest, Anxiety in¬ 
creases gradually tluough the school years. It is especially niLoiestmg to note 
that the anxiety scoies have only veiy slight negative con datums with abil¬ 
ity. Evidently, test anxiety is about as common among the vciy able pupils 
as among the dull ones 

The detrimental effects of anxiety may be incieased by the veiy tactics 
the tester uses to elicit the subject's best effoits Saiason and lus associates 
(1952) used his questionnaire to identify Yale fieshmen with high and low 
anxiety (HA and LA gioups, respectively) These gioups woic divided 
Half the students received ego-mvolvmg’ (El) instructions which sticssed 
that these were intelligence tests and would he used to assist in mloipioting 
freshman enhance tests. Tire NEI (“not ego-involved”) gioup, on the con- 
tiaiy, was told that the examiner was slundauh/ing some tasks and that no 
one would examine individual standings The test was a stylus ,uua , and 
five tuals weie given Enoi scoies aic shown in Figuic fi Wo find that the 
NEI gioups had intermediate scores, theic being little diffeumce between 
the HA-NEI and LA-NEI subgioups In the loxv-anxicly gioup, El instruc¬ 
tions had a small, generally helpful effect The subjects with high anxiety 
about tests, liowcvei, did much woise when threatened by tin- unpoitance 
of doing well than they did undci emotionally ncutnil conditions. 

That anxious and defensive reactions inteifeie with test efficiency is 
also shown m a study of student muses (G. Wienei, 1957) Student nurses at 
the top and bottom extremes on “distrustfulness” weie selected by a ques¬ 
tionnaire Each student took sections of the Wechslei intelligence scale. The 
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Picture Completion test asks the subject what is missing in a picture (e g., 
one eyebrow m a sketch of a face). Distrustful subjects weie inclined to 
deny that anything was missing when the answer did not come to them im¬ 
mediately. Likewise, on a test of similarities (“Plow are praise and punish¬ 
ment alike?") the distrustful students were more inclined to deny that the 
words weie alike. The difference, though significant, was small; distrustful 
students aveiagcd 2.7 suspicious comments on the two tests compared to 0.9 
for the trustful students To measuie the effect of suspiciousness on scores, 
Wiener compaied scoies on the PC and Similarities tests with a vocabulaiy 
score that piesumably is not affected by suspiciousness. This comparison 




FIG 6 Maze performance under ego-involving (El) and neutral (NEl) instructions (Sarason 
ef a/, 1952) 

shows that extreme suspiciousness lowers the IQ by about tlncc or foui 
points. As Wienei says, “People who say, 'Tlieie is nothing missing in that 
picturcl’ uie responding to internal needs lather than to the testing situation ” 
Plus is a maladaptive lespouse, and neeessanly loweis sunes 
Threats aie evei piesent in testing, a delinquent feuis that his punishment 
will depend on the lest lesulls, a elnld feais that a pool intelligence lating 
will disappoint his parents and dimmish then affection, a college gul lcais 
that failiue will force hei to leave her campus fiiends and leluin to the faim, 
an anxious patient feais that a test will prove lnm insane Fears such as these 
can be listed without end A staking example is the ease of the young le- 
serve officer, extremely eager to serve m time of wai, who faded his physical 
examination twice because the nnpoitance of passing made him emotional— 
and the emotion always brought his blood pressure over the acceptable 
limit. A series of “reconditioning” treatments eventually made it possible for 
him to take the test calmly. High blood pressuie is more directly an emo- 
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tional concomitant than is pooi thinking, but the disrupting physiological 
lesponses have then mental counterpaits. 

Insofar as the tester can convince the subjecl that the tests will be used to 
help him, not to harm him, the validity of scores will be increased. Em¬ 
phasis must be placed on the positive use of results. A job applicant feaiful 
of failing an aptitude test can be given to understand that tost seme's may 
indicate a field where he will succeed. A patient feaiing the verdict of a di¬ 
agnostic test should understand that it will point the way to a cure. 

22. Mandler and Sarason (1952) comment, “It is questionable whether intelligence 
test scores adequately describe the underlying abilities of individuals with a 
high anxiety drive in the testing situation " On the other hand, it can be argued 
that a person who is not motivated to avoid failure will perform below his best 
level. Which argument seems correct? How could you test whether anxiety 
lowers or raises ability scores? 

23 Hebb and Williams (1946) devised a test to measure the intelligence of rats. 
The test consisted of a set of mazes to be run, success being scored if a direct 
path to the foodbox was taken. What problems of motivation would need to be 
considered in administering this test? 

24 In an ‘'agility” test used by the British Armed Forces at one time, each man was 
tested separately while his squad of perhaps twenty others watched. The task 
called for running back and forth along a cross-shaped pattern, transferring 
rings from one post to another, 

a. What effect on score would be expected from being tested in a group 
rather than without an audience? 

b What effect would be expected as a result of announcing each man's score 
at the end of his trial—to be applauded if good? 
c. What advantage or disadvantage would a man have who came last in the 
group? 

25. If, on a personality test, a person reveals something discreditable about him¬ 
self, can one suggest any reason other than a strong desire to be honest? 


Preparing the Subject for the Test 

The motivatio n most helpful to valid testing is a clasho on llio pail of the 
su bject th at the i scoio be va lid. This is not the noiiruil cotnpelihvo sorwTiere 
one desires a high scoie whethei it is hue foi him oi not It is a .scientific 
set, a desue to find out the tiutli even if the Uulh is unpalatable. Ideally, tho 
subject actually becomes a paitner m testing himself. 

Too often an autociatic appioach is followed, something like “Take this 
test and I shall decide what is to be done with you.” Most testers would dis¬ 
claim any intention of dictating, yet it is hue that tests have most often been 
used for the private information of the tester, who then bases recommen¬ 
dations on them. 

C oopeiation betw een tester and subjects not an impossible goal Psy- 
chotheiapyjs bas ed on diagnostic testing; decisions of school administrators 
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depend on standard tests; the employment manager must take responsibility 
for hiring the best-qualified applicants. Responsibility cannot ordinaiily bo 
tiansfened to the poison tested, but the subject can be made a member of 
the tcslei's team. The toslei can take him into confidence as to the puiposc 
of the testing and portray the lest as an opportunity to find out about lum- I 
self, ]usl as the physician often tells the patient what medicine is being given j 
and what good insults are to be expected from it. IL the subject knows what 
a test is measuring and why a fan measurement is to his advantage, he will 
have little motive to provide an nnliulhfril pieliue Poiliaps the most “auto¬ 
cratic” of the current uses of testing is m industrial lining—necessaiily so, 
since the goal of testing is piofil to the firm Yet the tests given in the hinng 
line aie to the advantage of the person tested, and it will build good will if 
he knows it. The veiy facts regarding turnover that lead the employei to 
scieen applicants arc facts which would icassme the woikei if he knew 
them If he does well on the test, ho can have confidence that he will make 
good on the job II he does badly, he is unlikely to last on the job. The failure 
on iho tests saves him from wasting tune m a dead end; he can begin in¬ 
stead to accumulate experience and semoiily m anotliei job foi which he 
is fitted. 

The desiiabihly of preparing the subject for the test by appiopriato ad¬ 
vance information is mc-ieasingly recognized. It was formally the common 
practice m counseling cenlois to adimmstei a testbalteiy routinely to cveiy 
person coining m, and to use the test lesulls as a basis loi the fiisl counseling 
interview Now counseling more often commences with one 01 two inter¬ 
views which help the peison define Ins piolileni. The mleiview gives him a 
moie realistic understanding of what tests can do, reduces anxiety about 
the test lesults, and helps m the choice of tests. Another type ol mdoc- 
tiination is found in some of the great nation-wide testing piogiams like that 
of the College Entrance Examination Boaid. Booklets have been prepared 
for both the .Scholastic Aptitude Test and the subject pioficiency tests. The 
booklet describes the test, gives advice on efficient woik proceduies, and 
provides sjrocnnen items Tins inhumation inereases the applicant's confi¬ 
dence mid reduces the disadvantage which an apjrheant inexperienced in 
taking .standaul tests might otheivviso have. 

26. How could a "cooperative" point of view in testing be adopted: 

a. By a school principal who wishes to divide his eighth grade into sections on 
the basis of intelligence? 

b. By a veteran's counselor who must approve the plan of a handicapped 
veteran to go to college and prepare for dentistry? 

c. By a consulting psychologist who is asked by a social agency to diagnose 
and report on a potential delinquent? 

27. What explanation would you give the subject in each of the following cases? 

a. College freshmen are to be tested to determine which ones may fail be¬ 
cause of reading deficiency 
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b. At the end of a course in industrial relations for foremen, an examination 
on |udgment in grievance cases is to be given 
28. Is it ethical, in a test of emotional ad|ustment, to phrase directions so that the 
subject believes his imaginative ability is being tested? 

Coaching and Test Sophistication. Preparation may be carried to extiemc 
lengths. In Great Britain, a test given neai the age of 11 determines what 
type of secondaiy education a child will leceive. This is a fateful decision, 
opening oi closing the gate to most professions and to financial and social 
status, Parents, concerned to help their childien, often pay pnvato tutors 
to piepaie the child for the examination by special after-school lessons. In¬ 
deed, it has been said that m some distnets two-thuds of the candidates 
receive such “black-market” coaching The school system, unwilling that 
these childien should have an advantage, then may introduce a special 
“coaching class” duiing the term pieceding the examinations. Couching foi 
the anthmetic and language tests consists chiefly of additional dull The 
thud poition of the examination is a lest of geneial mental ability. Coaching 
may include study of tests used m past yeais, practice on reasoning problems 
used m typical mental tests, and instiuchon on how to solve test problems 
lapidly 

Piepaiation of this sort guarantees that the coached pupils peifmin at tluur 
best, but peihaps spoils the test by giving them an impiopci advantage over 
uncoached pupils To evaluate any such piocedure, it is neccsvuy to con- 
sidei the distinction between intrinsic and extimsic aspects of the test pei- 
formance (Gullrksen, 1950). The test is used to decide which pupils will 
profit most from a latei educational program Any ability which aids per¬ 
formance on the test and in the latei instruction also may be called mtrm&ic, 
wheieas an ability useful only in the test is extimsic to the decision being 
made Coaching which nnpioves the peifoimance intrinsically is fair, and 
does not spoil the test Teaching extra anlhmetic gives the pupil an advan¬ 
tage on the test, hut this extia training presumably will also make him a bet- 
tei student Teaching him how to solve mazes, howevei, is beneficial only m 
tests piesenbng maze items, it cannot help his later schoolwoik. 

There have been many studies to mcasuie the eflects of coaching, and the 
studies diffei in piocedme and lesulls Some veiy huge gains in score wow 
found in studies wheie subjects wcie initially almost completely unfamiliar 
with objective speeded tests Recent Butish results (Alfred Yates cl al. s 
1953, 1954) show what can be expected among reasonably well-educated 
pupils today Gams aie measured by repeating the same test after the ex¬ 
perimental interval According to these studies, 

“Control” groups gain (on the aveiage) about 2-3 points in IQ, merely 
as a result of taking the first test, 
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“Coached” gioups gain about 5-6 points, after having been told about 
tests and having had nutneious representative items explained by the 
teaeliei. 

“Piacticed, uncoached” gioups gain about 6 points, after taking fiom 
foui to eight tests without special explanation. 

“Piacticed and coached” groups may gain 8-10 points. 

It is noted in all these studies that a very extended period of practice or 
coaching is no more helpiul than a few sessions, Gains such as those shown 
in these studies, and in the corresponding Ameiiean studies (Deai, 1958), 
arc idatively small While they might make the diffeience between success 
and faihue m obtaining admission to a higliei school, foi a pupil neai the 
bordeihne, coaching will not iaise a pooi college piospecl sufficiently to help 
him ovei the examination huidle 

29 What implications does the British investigation of coaching have for Ameri¬ 
cans who use mental tests to select scholarship winners? 

30. In Japan a young man's career opportunities depend very much on his ability 
to capture one of tho limited number of openings in the University. Vacancies 
are filled on the basis of entrance examinations and school records Maga¬ 
zines bearing such titles as Student Days, Examiners' Circle, and Period of 
Diligent Study have large circulations. These magazines deal with topics of 
interest to candidates including information about typical test materials 
(though the actual test questions are of course guarded) Would such maga¬ 
zines increase or decrease the validity of the tests? 

31. In planning a competitive mental test to be given all Japanese youth applying 
for higher schools, two policies appeared possible One was to devise new 
types of test items each year, so that knowledge about previous examinations 
would be of no help The other proposal favored using the same types of 
questions every year (for example, number series) but changing the items used 
Compare the plans from the point of view of the test maker, the student, and 
the person interpreting the results 

32 Which of these types of preparation for a scholastic aptitude test leads to 
changos in intrinsic ability? 

a. Vocabulary-building exercises. 

b. Advice about whether to guess when in doubt 

c. Therapeutic counseling to reduce fear of failure and feelings of inade¬ 
quacy 

33 In some college residence halls, students file questions from past examinations. 
From the point of view of the professor teaching the course year after year, 
does this increase or decrease the validity of measurement? 


Testing Procedure as Standardization of Behavior 

We may undeistand better the pioblem of flaming directions and aiousmg 
proper motivation if we lealize that the psychometric tester tnes to stand¬ 
ardize the behavioi of the subject, as well as the test stimuli Even though he 
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is mea suring in divi dual differences , liis piocedures are designed to eliminate 
individual differences—to eliminate, that is, variation in eveiy char acter istic 
sav e the one that his test is su ppos ed to measure 

To clarify this, considei the physiological measure of basal metabolism 
rate. If a doctor wants a BMR measure, he lequnes his patient to fast ioi 
eight horns befoie the test. This eliminates differences in eating habits 
which would affect oxygen utilization Foi the test itself, it is neecssaiy to 
reduce the patient’s bodily activity to an absolute minimum by putting him 
into bed, every patient is, m effect, 1 educed to a standard activity level. The 
BMR, calculated fiom the oxygen intake and the caibon dioxide exhaled, is 
a useful measuie of the patient’s physiological state This measui e is taken 
m an artificial “standaid condition” winch almost nevoi occvus in ical hie. 
The peison’s metabolism xate as he goes about Ins daily affairs is not much 
like his BMR, since it is affected hy Ins eating, activity, and other vaiiublos. 

Psychological tests aie similarly designed to extiact one, vaiiuble, pmified 
as much as possible, fiom the total life activity. The psychologist is concerned 
if some students fail to undeistand lus directions because this nrelevnnl dif¬ 
ference will affect liis lesults. Ide is concerned if some students receive 
coaching, if some are especially anxious about the test, il some mteipicl the 
test as a speed test while otheis think caicfulness counts most. All these 
sources of variation blui liis measuiement. Pie lues, in setting the stage im a 
test, to reduce all lus subjects to a “standaid stale” of motivation, expecta¬ 
tion, and interpretation of the task 

An example of such standardization is found m ceitain tests' intended to 
measure peisonality traits. One mrght evaluate these qualities bv observing 
behavior in eveiyday affaus The moaning of this behavior is uncertain, how- 
evei, since different subjects may be hying to do quite diileient tilings If 
the situation is more definitely stiuctuied so that all subjects have the same 
goal m mind, differences are more ceitamly attributable to peisonality For 
this reason, many tests of persistence, reaction to fiustialion, flexibility, 
and othei hails aie disguised as measmes of ability. The subject is given a 
definite task and motivated just as foi an ability lest. lie docs not mdi/e that 
the tester will pay attention chiefly to how lie goes about the task. 

TESTING AS A SOCIAL RELATIONSHIP 

The testei has been accustomed to think of himself as an unemotional, im¬ 
partial task-settei PIis tiaditions encouiage the idea that he, like the jrliysical 
scientist or engineer, is “measuiing an object” with a technical tool But the 
“object” befoie him is a person, and testing involves a complex psychological 
relationship The traditional concern with motivation and rapport recog¬ 
nizes this fact but, as illustrated in the foiegoing sections, leads to little more 
than a recommendation that the tester be pleasant and encoui aging, and 
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help the subject understand the value of the test. This, we are beginning to 
suspect, baicly touches the leal social-psychological complexities of testing 

As Schafer (1954, p. 6) says, 

The clinical testing situation has a complex psychological shuctuie. It 
is not an impelsonal getting-together of two people in older that one, 
with the help of a little “lappoil,” may obtain some “objective” test le- 
spouses fiom the other. The psychiatric patient is in some acute or 
chiomc life cusis. lie cannot but biing many hopes, fears, assumptions, 
demands and expectations into the test situation. He cannot but le- 
spond intensely to ecu tain leal as well as fantasied attributes of that sit¬ 
uation. Being human and having to make a living—facts often ignored 
—the Lestei too biings hopes, feats, assumptions, demands and expecta¬ 
tions into the test situation He too icsponds peisonally and often in¬ 
tensely to what goes on—in lealily and m fantasy—in that situation, 
howevei well he may conceal Ins pet sonal i espouse from the patient, 
fiom himself, and fiom Ins colleagues. 

The subject coming foi an individual test almost invariably is in difficul¬ 
ties. He may have been lclened by some authority who demands that he 
be tested; if so, the testci may be simply anothei authoiity to rebel against. 
Other subjects aie sclf-iefcned. One might expect cooperation m such a 
case because the subject is asking foi help, but he loo may come with mo¬ 
tives which conflict with the lestei’s objectives. The veiy fact that he has 
had to seek psychological help may disturb the poison who wants to be in¬ 
dependent He may have doubts legauhng his own adequacy which he 
attempts to suppress by eveiy available strategy. It is commonplace to dis¬ 
cover, behind a college student’s self-refeual foi remedial reading oi voca¬ 
tional counseling, a pioblem of sexual adjustment oi emotional conflict with 
parents. The student, by focusing his attention and that of the psychologist 
on a superficial oi nonexistent pioblem, is using an unconscious sleight-of- 
hand to conceal the piobloms he does not want to face. 

Instead of being hostile' and resistant, the subject may present himself as 
friendly and totally submissive. This can go far beyond the normal, mature 
fact-giving which the lestei wishes loi. Some subjects “turn themselves over” 
to the psychologist, theioby avoiding responsibility foi their own problems. 

None of us is willing to expose himself completely, oi even to learn the 
' whole tiuth about hunscli, yet the job of the tester is to penetrate peisonal 
seciets In clinical testing and mteiviewmg particulaily, the psychologist 
leally tiies to bring to the suiface the whole personality—sexual attitudes, 
feelings of inadequacy, hostilities and wishes the patient is ashamed of, and 
so on. Even when the tester has a much more limited aim, the patient may 
believe that his intimate desues and anxieties will be exposed by the tests. 
The popular literature on psychology and psychiatry being what it is, the 
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subject may expect the psychologist to be almost pruriently concerned with 
tabooed aieas, Oi he may view the tester as a modem magician from whom 
no tmth can be hidden and whose every judgment is beyond question. 

These attitudes define a lole which the tester is expected to play, and 
tire testei’s own self-conceptions define another. When tester and subject 
meet, therefore, their mutual demands may suppoit each other, 01 (hey may 
pull m opposite directions. A client who wants to escape responsibility may 
fall into the hands of a tester who likes to pose as infallible and to dominate 
otlreis This tester is unlikely to sense that the client’s seeming passivity is 
just a stiategy adopted to keep the tester from piobing into an unstated 
problem. The situation is little better if the testei is one who, because of self¬ 
doubt, cannot comfoitably take responsibility Piesscd by the client to make 
a definite lecommendation, this insecure tester will 1 cheat fiorn icsponsibil- 
lty He will pile test upon test, so that the mass of data will relieve him of the 
burden of judgment. He will qualify his interpretations and obscuie them in 
technical jargon to fiustiate the client’s unacceptable demand. Finally, 
he terminates the counseling with "All tests can do is give you a basis for 
making youi own decision ” By tins he avoids a counseling 1 elation—longer, 
more intimate, but uncomfortable—m wlneb he could luing the client to 
understand his passivity and hesitancy 

Schafer points out that the testei chooses Ins profession because it satisfies 
his needs. The testei may be one who feels inadequate m social iclalions, 
but who can obtain leassurance from seemingly objective instillments lie 
may piefer the brief and distant contact of objective testing to the demand¬ 
ing peisonal relations that teachers and therapists have. He may bo answer¬ 
ing doubts about himself by comparing himself favoiably at oveiy turn with 
those he tests. On the contrary, instead of having these remote and com¬ 
petitive attitudes, he may be one who seeks grateful and dependent reac¬ 
tions from subjects. 

All these patterns can distoit testing procedmcs and test iniei [notations, 
The oveily “objective” tester may bo unwilling to give the subject the emo¬ 
tional support lequued to leduce resistance and elicit Ins best jieilmmancc, 
He may oveiemphasize difficulties that can be heated uiieniolion.illv (lim¬ 
ited vocabulaiy, for example) but overlook emotional needs. The competi¬ 
tive testei may be too leady to identify weaknesses, or to describe subjects 
he admnes as having viitues he hopefully sees in lnmsell (Wilson tells us 
that, when he tiamed a group of intelligent convicts to give a peiformance 
test to new inmates, he had to supervise constantly to prevent then making 
proceduial errois to reduce the subject’s score and so magnify then own su¬ 
periority. ) The tester who seeks emotional support from patients may be too 
lenient and encouraging, and all too willing to overlook weaknesses in the 
record 
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Granting that both testei and subject come to the situation with a full com¬ 
plement of human motives, some of which they aie not awaie of, what 
should the tester do about it? At this point, with research on these motives 
almost entnely lacking, we can make only common-sense suggestions The 
lust of these is “Know thyself.” The more the testei knows of his own per¬ 
sonality, of Ins pieieiences for diffeient types of subject, and of the biases he 
bungs to test intoi pi elation, the gieatei the chance that lie can meet each 
situation pioperly, The second suggestion is Schafei’s recommendation that 
the social situation itself be considered an important way of understanding 
the subject, and that Ins stiategies, demands, and lesistances themselves be 
taken into account m interpreting seoies. His view, which is the only thor¬ 
ough statement on the pioblem yet attempted, is well summarized in this 
paiagiaph (1954, pp 72-73): 

Their air those who would object that this total-situation appioach 
violates the objectivity of test mteipielation Only in the nanow and 
false souse m which objectivity has been usually conceived is this true. 
The ideal ot objectivity icquiies that we recognize as much as possible 
what is going on m the situation we aie studying. It icquiies m particu¬ 
lar that we lemembei the testei and his patient aie both human and 
alive and thereloio inevitably interacting m the lest situation. Tine, the 
fuithei we move away fiom mechanized mtcipietation oi companson 
oi formal scores and averages, the nioio subjective variables we may m- 
tioduee into the mterpictive process The personality and personal lim¬ 
itations of the tester may be hi ought into the thick of the mteipiolivo 
pioblem But while wo thereby increase the likelihood oi personalized 
inter pi elation and vanation among testers, we aie at the same time m a 
position to cmich oui understanding and om test lcpeuts significantly 
The more data we use, alter all, the gieatei the richness and specificity 
of our analyses—and in the long mn the mine accmale we become 

Schiller’s view obviously demands impressionistic inter pi elation, and is 
not fully acceptable to psvehonielnc testers Ilis view need not be accepted, 
since no evidence is ofiered that these complex lnleijnetations can indeed 
be made acemalelv Those who lejeel Sell aid's leconumriclalton must, how- 
evei, face the pioblem of inter personal dynamics and find then own solu¬ 
tion Even a stnelly poker-faced administration of an individual mental test 
is an hour-long stiess situation, every moment oi which involves emotional 
interaction between tester and subject 

From a psychometric viewpoint, it has been suggested that the effect of 
testei personality is merely that diffeient testeis obtain diffeient aveiage 
results. IQs obtained by one tester average a few points higher than those 
of another Roiscliachs given by tester X contain more “movement” responses 
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than those of tester Y Why not, then, "calibrate” each tester as a laboratory 
Iheimometer is calibrated, so that his cnois can be compensated foi? If we 
know that, on the average, his Weclislei IQs average 2 3 points higher than 
those of othei tesleis, we can adjust his lopoits. This is not a realistic sug¬ 
gestion. Calibration lequnes an overwhelming amount of lesoarcli, and at 
best such a collection deals with the average eiroi rather than the enois 
which vaiy fiom case to case Individual tests cannot be standardized well 
enough so that all testcis will obtain identical lesulls, the best hope is that 
caieful training of testeis can lemove most of their consistent enois 

34 In what way could sympathy and love for children bias a tester? What parts 
of the testing process would be affected by this bias? 

35. If social factors and examiner differences affect individual tests more than 
group tests, does this imply that group tests are better measuring instruments? 

36. Does a formal and impersonal attitude toward all subjects standardize the 
testing relationship? 


Suggested Readings 

Bibei, Baibaia, & olheis Slenogiaphic lccoid of psychological examination, 
Life and ways of the seoen-ycat -old New Yoik. Basic Books, 1952 Pp 631-039. 
This is a lecoid of lemaiks made by both examines and subject befntc and 
dunng a seiies of peifoimanco tests of mental ability, including tlic* I’oitous 
maze and seveial foimboaids, Note the many places wlioic the evumnei is 
willing to digiess fiorn the test into other conversation m oidoi to maintain 
rapport. 

Bingham, Wallei V. Adrninistiation of tests, and Giving group tests Aptitude t 
and aptitude testing New Yoik. Haipei, 1937 Pp. 224-214 
These common-sense suggestions, based on long cxpoiience, will lie of 
gieat value to beginning testers A tmnslation of a Ooiman checklist for 
observing behavior dining the test, used foi an impicssiomslic evaluation of 
performance, is included 

Schafei, Hoy Inteipeisonal dynamics in the test situation Psychoanalytic inter- 
pi etation in Roischach testing New Yoik Giune & Stratton, 1954 Pp 6-73. 

In a thought-piovokmg discussion of the motives with which the texloi unci 
subject appioacli each othei, Schafei speculates icguidmg defenses m tire 
testci's pcisonahly (dependence, ovciinlcllcctuiili/atiou, sadism, etc.) which 
may induce Ins effectiveness 

Thompson, AnLon Test-givei’s self-invcnloiy Calif / cduc. Res., 1950, 7, 67-71. 
A checklist pointing out neaily fifty specific practices that char.ictcir/o good 
test adrninistiation includes numeious techniques that piuctic.il expeiicnce 
shows to be advisable, which inexpencnced gioup testeis tend to oveilook. 

Wilson, Donald Powell. My six convicts New York Rinchait, 1951 

A best-sellei descnbes the expeiiences of a psychologist doing leseaich on 
diug addiction in a puson, using convicts as testing assistants Of special value 
aie Chaptcis III and IV, descubmg how the team oveicanre lcluctance of 
convicts to take tests. See also, on p 235, the explanation m convict language 
of “the coefficient of correlation.” 
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Scoring 


SCORING PROCEDURES 

ANY student who has tiled to understand why he received a low score on 
some essay examination must lealuo how difficult it is to define a good an- 
swei and to determine the pioper ciedit foi a pailially collect response. 
Starch and Elliott (1912, 1913) piovidcd eonclnsive evidence, on the laults 
of impiessioiuslic seoimg as long ago as 1912 They piesenled a pupil’s Eng¬ 
lish composition to a convention of leacheis and asked a nuinhci of volun¬ 
teers to guide it On a poiconlage scale, the guides assigned uinged fiom 50 
to 98. This evidence ol disagreement could pci Imps bo dismissed, since judg¬ 
ing a composition is influenced by piefcienees loi vanous styles To diivo 
home then point, however, they had a geometiy papei guided in the same 
way. The semes uinged fiom 28 to 92, picsumably because of vaiiation in 
the ciedit given to neatness, paitial solutions, eLc. 

No scientific icseaich on behavioi can be done noi can we hope to leach 
sound piactical decisions if scoiing standards vaiy enatically One solution 
is to develop uilcs for judgment which all scoiers will follow The otliei 
possibility is to use lceognition items wheie the subject is to choose the light 
answer; this eliminates all judgment fiom seoimg, once the initial key is 
agieed upon by competent pci sons. 

Scoring of Free Responses 

Individual testing continues to use pioblems calling foi some degree of 
judgment m scoiing, but methods can he devised which peimit lauly ob¬ 
jective scoiing of the important fealuies of behavior. Ayics, for example, 
pioduced a guide for scoiing pupil handwriting (Figure 7) Samples of 
handwriting representing various levels of quality aie given, the teachei lo¬ 
cates die sample most similar to the pupil’s work in oidci to deteimine his 
scoie. Product-rating scales can be developed foi judging quality of sewing, 
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shopwork, etc Objective methods have not been completely successful m 
scoring verbal tests, but variation among scoieis is reduced by guides which 
show the appioved sconng foi lepicscntativc answers Nolewoithy examples 
are the sconng manual foi the Stanfoid-Bmet test of intelligence (Teaman 


60 

90 

/h/khstiLj a,vu&£- 

the* 



FIG 7 Part of Ayres' scale for scoring handwriting samples (Copyright 1912, Russell Sage 
Foundation Reproduced by permission of the present publisher. Educational Testing Service) 


and Menill, 1959) and the volume by Beck (1944) on the Roischaeh tost 
of peisonality, 

While the sconng guide is adequate foi most testing of individuals, special 
precautions must be taken when fice-icsponse tests aie used to compare ex¬ 
perimental tieatmenls The scoier who believes 01 wishes to piove that one 
treatment is supenoi may unconsciously lend to give higher semes to the 
subjects who had that treatment (Goodenough, 1940). To pi event such bias, 
it is necessaiy to mix all records togethei bcfoie presenting them to a scorei 
who does not know which gioup any peison belongs to, This procedure is 
called “blind” sconng 

!• The question "Why should people wash their clothing?" is to be used in an oral 
intelligence test for adults, to test comprehension of common situations Prepare 
a set of standards for |udging correctness of answers Make your rules so clear 
that scorers would be able to agree in scoring new answers. 

Scoring of Recognition items 

The scoring guide for a recognition test consists ol a list of the collect an- 
sweis and a sconng key which a cleik can use. Seveial efficient pioeedmes 
foi obtaining and sconng lesponses have been devised, including the eai- 
bon booklet, the pinprick booklet, and the separate answei sheet. One com¬ 
mon foim of the sepaiate answei sheet is shown m Figuie 8 With the sepa¬ 
rate answei sheet, costs are 1 educed because the same booklet can be used 
repeatedly, and the answei s can easily be scored by a punched key oi by 
machine The caibon booklet consists of a face-page backed with carbon pa¬ 
per and a hidden under-page printed with an answer key The pages are 
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sealed together, and the subject marks his choices on the face-page. His 
marks aie tiaced by the eaibon onto the bottom page. The scorn teais open 
a peifoiated edge of the booklet to leveal the bottom page on which punted 
squares show where the coirect answcis should appeal. It is a simple matter 



FIG 8 Portion of answer shoot for machine scoring {Courtesy International Business Machines 
Corporation ) 


to count the number of cuibon nuuks falling within the squares The pin- 
pnck method is sunilai to the.eaibon booklet Instead of cheeking Ins an- 
swei with a pencil, the subject sticks m a pm at that point. Squaies are 
punted on the back of the page so that when the booklet is loin open the 
liumbei of holes billing witlun squaies indicates the scoio 
Machine Scoring. The scoring machine most widely known is that de¬ 
veloped by International Business Machines in the late 1930’s The subject 
blackens an answei space with a soft pencil. Eleetnfied “fingeis” in the 
machine sense wlieie pencil maiks appeal, since the giaphile in the inaiks 
cames cuiient A inetei shows the total numbei of piopcily placed marks 
The machine will lepoit numbei of ciiois, lights-minus-wiongs, and other 
types of scenes Undei ideal conditions, it can scoie as many as 500 papeis 
pei lioui accuiately (Tiavlei, 1954, Lindquist, 1951, pp 408 IT ) Mihlaiy 
classification cenleis icly on the maehines to piocess lecnuls Laige school 
systems opeiate such machines to scoie tests loi the enliie system. In most 
sections of the counliy, a test-scoring seiviee is available wlieie tc'sls bom 
scatteied schools may be niaelime-scoied loi a inodeiate lee 
The mam difficulty with tlie IBM machine, and one which concerns lest 
admimstiatois, is that it cannot scoie accuiately unless answei spaces are 
neatly blackened by the student While the machine is supposed to deliver 
one unit of ciccht whcnevei a maik appeals m the right space, whethei the 
mark is light or heavy, wide oi nanow, in piactice this cannot be expected 
The mark must be made with the piopei soil of soft pencil, and to be sure 
of being counted must fill completely the space between the dotted lines on 
the answer sheet Fuitheimore, stray pencil marks and smudges due to un- 
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tidy erasure legister and are counted as errors, The number of improperly 
blackened or untidy papeis is so gieat that scoiing agencies have cleiks 
examine each papei befoie feeding it to the machine; if necessary, the cleik 
blackens moic heavily wheie the student used a faint maik to indicate liis 
answer, and eiases stray maiks This adds appreciably to the cost of scoiing. 



The need for such re-maikmg can bo almost entirely eliminated by piopcr 
test admimstiation and piocloiing 

Newei electronic scoiing machines aie becoming available One devel¬ 
oped by the Umveisity of Iowa foi its high-school testing progiam combines 
a photoelectric “leading” device with an electronic computer (Lindquist, 
1954) Responses to as many as 960 items (i.e, an entiie battery of tests) 
can be put on a single answei sheet. The student lecoids his name m a spe¬ 
cial "name grid” (If Ins name is Jones, he blackens J in the first column, 
0 m the second, etc ). It is estimated that this machine is able to score 6000 
answei sheets per hour and print both law scores and converted seoies on a 
summaiy sheet, Pait scores and weighted composites can be obtained, and 
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many computations desired for research can also be earned out at the same 
time. 

Wc aie beginning to sec automation m testing itseli, as well as in scoring. 
This is particularly demonstrated in the application of “Skinnei box” tech¬ 
niques to human subjects As used in the psychological laboiatoiy the Skin- 
nei box has taken the foim of a cage m which the rat, pigeon, or other animal 
activates a mechanism by making a paiticulai iespon.se: sinking a levei, 
tapping a spot on the cage wall, etc Itewaids for conect peifoimnnce can 
be admimstcied automatically accoiding to any desired schedule—foi exam¬ 
ple, by dropping a food pellet into the tray at the end of one minute if the 
lever has been stiuck dming that minute. Skinner (1958) is now adapting 
the same principle to the study of arithmetic pcifoimance of clnkhen. The 
child lesponds to problems piosonted by the machine, pushing a button to 
indicate his answer; a conect icspon.sc is lewarded by a signal The machine 
puts the child tlnough an automatic dull, and at the end dehveis a record 
showing the child’s late of lesponsc and his aecuiacy, 

In a mental hospital, Skinnei and Lmdslcy airangecl a loom wlieic pa¬ 
tients leccive lewaids foi pulling a level (Lindsley, 1956). The lowaids in¬ 
clude cigaiettes, a buof look at an enteitaming pictuie, or (as a social 10 - 
waul) the opening of a window wlieic they can watch one ol the doctors 
working at Ins desk. In this device also, an automatic lccoiding machine 
traces the late of response and thus piovides a peiloimance iccoid which 
might have diagnostic significance. One of the most sinking fcatiues of the 
method is that it piovides a completely nonveibal test The subject can be 
intioduccd into the test loom with no liistiuclions whatsoever and left to 
discover foi himself what happens when he pulls the level and to lcspond 
to the reward in his own manner. 

2. If Skinner's procedures yield inferpretable information, it will be possible to ad¬ 
minister the "test" automatically to mental-hospital patients. Disregarding ques¬ 
tions of cost, what are the advantages and disadvantages of automatic testing, 
as compared with face-to-face testing by a clinical psychologist? 

3. Reexamine the directions for the TMC (Chapter 3) How would you alter those 
directions to make sure that students blacken answer sheets satisfactorily? 

4. What effects upon the character of tests and their use might be expected to fol¬ 
low from the availability of a machine which makes it possible to obtain vir¬ 
tually an unlimited number of scores from a single answer sheet, at negligible 
cost? 

INTERPRETATION OF SCORES 
Raw Scores 

Most tests yield a direct numerical repoit of a peison’s peiformance called 
lus taw score This may be the number of questions he answeied, the time 
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he required for the task, or some similar number Because raw scoies are 
readily available, and familiar from long expenence in classroom examina¬ 
tions, many people inteipret them without realizing then limitations, An 
example fiom the old-fashioned leport caid will demonstrate the pioblom, 

Willie bungs homo a repoit showing that Ins aveiage in arithmetic is 75, 
and his aveiage in spelling is 90. His patents can be counted on to piaise the 
latter and disappiove the formei, Willie might quite prop oily protest, “But 
you should see what the other kids get m aiithmetic Lots oi them gel 60 
and 65 ” The paients, who know a good giade when they see one, iefu.se to 
be sidetiacked by such inelevance But what do Willie’s guides mean? It 
might appeal that he has masteied thiee-fourths of the course woik in arith¬ 
metic, and nine-tenths in spelling But Willie ob]ects to that, too “T learned 
all my combinations, but he doesn’t ask much about those The tests aie full 
of word problems, and we only studied them a little ” Willie evidently gets 
75 peicent of the questions asked, but since the questions may be easy oi 
liaid, the percentage itself is meaningless. We cannot compaie Willie with 
his sistei Sue, whose teaehei in another guide gives much easier tests so that 
Sue bungs home a pioud 88 in anthmetic It could ho, too, that Willie’s slim¬ 
ing 90 m spelling is misleading, if the spelling tests deal with the veiy wouls 
assigned foi study 

A raw scoie on a psychological test, taken by itself, has no significance It 
can be interpieted only by companng it to some standaid Stoddaid’s le- 
maiks (1943, p 83) lllustiate this point “In Ameiican college elides, the 
statement that John Smith has run 100 yaids m 9 % 0 seconds lovouls an 
extiaordinary accomplishment On a puou giounds theiu is no occasion for 
knowing whether a man should lun 100 yaids in 8 seconds oi 20 seconds. 
But Smith is immediately placed against the backgiound of the thousands 
of men who, having lun then best, could not get undei 10 seconds ” 

It is especially impoitant to leahze that we cannot intcrpiet psychological 
test scores as we do physical mcasuies Physical measunng scales gencially 
have a Lue zero and equal units along the scale, this penults us to say, for 
example, that one boy is twice as tall as anothei, oi has attained 00 peicent 
of his piobable adult height. We cannot make statements like this about psy¬ 
chological measures, Suppose Willie had earned a scoie of 10 peicent m 
spelling. Would this mean that he knows only one-tenth ol the wouls lie 
should? No, foi the teaehei piobably did not ask about easy wouls that Willie 
was sure to know Even a scoie of zeio on the test would not mean zeio 
ability to spell The difference between Willie with a scoic of /eio and the 
model pupil who earns 100 is peihaps a diffeience m ability to spell only 
twenty words out of an active vocabulary of seveial thousand—if those 
twenty constituted the test The same aigument applies to tests of leasonmg 
ability A raw score of 80 may appear to lepresent ability twice as great as a 
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raw score of 40. The test does not include the problems everyone can solve, 
however; if people weie tested on every possible problem calling for lea- 
somng, the true latio might be 140 to 180, or 1040 to 1080. Even an infant, 
looking toward the door when lie licais his mother’s footstep, shows some 
degiee of ability to leason Absolute zeio in any ability is “just no ability at 
all ” 

Differences in raw scores do not ordinanly represent “tiue” distances be¬ 
tween individuals. Suppose, on DAT Mechanical Reasoning Form A, Adam 
gets 53 points, Bill gels 56, and Charles gets 59. The raw-score differences are 
equal Is Chailes tiuly as different fiorn Bill as Bill is fiom Adam? We cannot 
be suie, since the scoie diffeience depends on the items used Judging 
fiom the published noim tables for twelfth-giaders, if these same boys took 
the Bennett Foim AA the “equal differences” would be replaced by un¬ 
equal ones. Adam would get 44 points, Bill would get 45, and Cliuiles 48 
The only way one can meaningfully talk about “equal diffeicnccs” is to 
bung in some practical enterion which pi ovules a standard of value Differ - 
ent standaids will lead to different numencal scales foi the same test On 
the DAT the tlnee boys' raw scoies are equally spaced Their piobaluhtics 
of passing a college engineering com sc may be 70, .90, and ,96, lespeclively 
Then most likely lieshtnan grade aveiages may be D, C-I-, and B— And 
then respective probabilities of later success m a very demanding engineer¬ 
ing Him may be 0001, 05, and .50. “Equal mteivals” on one of these scales 
are quite unequal on the other 
Having sewed a lest, the tester lias four alternatives 
® He may compare the score dneclly to some accepted standaid oi pei- 
foimance. Foi example, a school may admit to the lust guide all childien 
who earn a certain predetermined scoie on a leadmcss test 

® He may compare the scoie to othei scoies in the gioup tested 
® He may compare the score to scores in a reference group by means of a 
table of nouns 

• He may use an expectancy table to estimate the individual’s probable 
subsequent performance 

Of these methods, the most common is to compare the individual with 
a lefeienee gioup The testei refers to a table m the manual to learn what 
the normal lange of performance is Moic than that, lie converts the uiw 
scoie into sonic type of clawed scoie which is a permanent lecoul of the in¬ 
dividual’s relative position The most common types of eleiived scoie are 
pcicentiles and standard scores 

Many statistical methods used by the test developei aie simple enough to 
be followed by the test user, who can piepaie noim tables oi expectancy 
tables for Ins own gioup Expectancy tables, which we consider first, lequne 
no more than simple tabulation, and calculation of peicentages. 
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5. Decide whether an absolute zero exists for each of the following variables and, 
where possible, define its 

a. Height. 

b. Ability to discriminate between the pitches of tones 

c. Speed of tapping 

d. Gregariousness, seeking the companionship of others, 
e Rifle aiming 

6. If several pupils in Willie’s class move away and are replaced by newcomers, 
will his raw score in arithmetic probably change? His rank in his class? 

7. If a different set of test questions were used in arithmetic, would Willie's raw 
score change? His rank 7 

8 . Alfred, a college freshman, is to receive guidance on his academic plans, and is 
given four tests of ability Scores are presented in four different ways. Interpret 
separately each row of scores. 



Vocabu- 

Verbal 

Non verbal 

Mechanical 

Compre¬ 


lary 

Reasoning 

Reasoning 

hension 

Raw score 

116 

32 

44 

48 

Percent of possible points 

77 

73 

80 

71 

Points above average 

24 

10 

20 

0 

Rank among 260 freshmen 

104 

113 

161 

136 


9. Two runners train for the mile. One, between his junior and senior years, reduces 
his time from 4 16 to 4 04. The other starts with a time of 5-16. What time must 
he achieve for us to say that he has made as much improvement as the first 
runner? 

Expectancy Tables 

The expectancy table is a useful device for mtG ipreting_pcif,or.uxancc . The 
test developer oi test usei administei s .the tost to a largo number jjf poisons 
an d subsequendy obseivcs theii success, these results c an be tabulated.to 
for m an experience table such as Table 1. This table is based on application 
of a geneial scholastic aptitude test (the Ohio State University Psychologi¬ 
cal Examination) to 920 fieshmen at Ohio State. To mteipiet a student’s 
score, the counselor need only diiect attention to the iow ol the table corie- 
sponding to the scoic, the cntiies show how likely the student is to attain any 
particular giade aveiagc. This explanation is moie definite and inoio com¬ 
plete than can be ofleicd by any othoi system of norms. As Bingham says 
(1951, p. 552), “The counseloi of an cnleimg student who has scored in the 
lowest decile range (lowest tenth) on this test can now show him these ex¬ 
pectancies, and point out, if it seems advisable, that his chances of keeping 
off probation (Pomt-Houi Ratio = 1 50) aie a little bettei than even; that he 
has one chance in a hundred of earning high honois, and that in any event 
much depends on the peisistence and stiengtli of his own deteimmation, a 
powerful factor not measured by this oi any othei psychological test.” 

Expectancy data may also be piesented m charts like Figure 10. The 
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TABLE 1. Expectancy Table for First-Semester Freshman Achievement 


Score on OSU 

Psychological Probability of Earning a Point-Hour 

Test Ratio of at Least 


1 50 


Raw 

Score 

Percentile 

Rank 

I 00 
(D av.) 

(Proba¬ 

tion) 

2.00 
(C av) 

2.50 

3 00 
(B av.) 

114-150 

90- 

100 

99 

93 

80 

56 

102-113 

80-89 

100 

96 

91 

60 

30 

92-101 

70-79 

100 

95 

90 

60 

29 

83-91 

60-69 

99 

90 

78 

41 

27 

75-82 

50-59 

98 

87 

74 

25 

13 

66-74 

40-49 

97 

80 

62 

25 

13 

56-65 

30-39 

96 

79 

61 

17 

5 

48-55 

20-29 

95 

75 

47 

13 

4 

39-47 

10-19 

95 

63 

33 

7 

2 

0-39 

-9 

87 

58 

29 

3 

1 

Souiu i. 

Binglmm, 1951, based 

on data from G B 

PunKon 




c hart gives less pimsv mtctipietations than the tab lc_b ut is especially useful 
for explaining sc-oic\s_ _l ()-laymen . The charts illustrated are based nn tW n 
different tests , it can be seen that the clexlenty test is a much less accurate 
piedrctox than the other two Besides niter putting scores for individuals, the 
expectancy table gives information on the validity of a lost (see Chapter 5) 

10. Expectancy tables prepared for local use are clearly meaningful. Can ex¬ 
pectancy tables profitably be included in test manuals, in view of the fact that 
probability of success on a ]ob depends on local conditions? 

11. Interpret this information about scores of a prospective aircraft armorer, prior 
to training: mechanical aptitude, 120; trade information, 140, nut-and-bolt 
test, 100. 


Mechanical 
Score Aplilude 


PROBABILITY OF SUCCESS 
as a Function of. 

Trade 

Informahon 


140 

mmmzzzm. ?i 

82 ^ 




120 

70 

wmzzm 60 3 




100 

39 

35 2 

80 


3^3,15 1 

60 

3.3 r = 51 

3.6 r =.47 2 


Nut-and-Bolt 
Manual Doxtorily 


.76 


62 


.45 


17 


r= 26 


FIG 10 Expectancy charts showing probability of earning at least an average grade in 
training as an aircraft armorer (Personnel Classification Tests, 1946). 



74 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


Percentile Scores 

The easiest way to make comparisons is to rank the scoi es from highest to 
lowest. Reporting that a person stands third out of forty conveniently states 
his position lelative to otheis, Ranks, however, depend on the numbei of pei- 
sons in the gioup. If we wish to examine change m standing from one occa¬ 
sion to another we have difficulty because the size of the group changes. 
To avoid this difficulty, lanks are changed to peicentile scoies (also called 
peicenlile ranks and centile lanks). A percentile score is the lank expressed 
in percentage teims A peisons percentile scoie tells what propoition of the 
gioup falls below him. Suppose there are 40 persons, 27 superior to A and 12 
poorer Then we aibitranly divide case A (and all peisons tied with him, 


1. Begin with the raw scores 

37 

43 

27 

44 

27 

27 

26 

31 

35 

42 

50 

(these are 

scores of 75 

35 

43 

36 

26 

50 

47 

36 

26 

32 

32 

38 

ninth-grade 

boys on Ben- 

36 

21 

24 

40 

39 

35 

38 

36 

38 

21 

17 

nett Form AA) 

26 

35 

22 

18 

50 

30 

38 

50 

16 

45 

8 



34 

26 

34 

28 

41 

27 

39 

41 

30 

23 

33 



22 

31 

36 

40 

54 

24 

22 

8 

33 

42 

41 



41 

31 

34 

36 

32 

20 

22 

34 

41 




2. Identify the highest score Highest score — 54) lowest score = 8, range = 46. 

and the lowest score. If Class interval of 5 will be used. (A smaller Interval, 

there is a wide range, such as 2, would be preferable but would be in- 
choose a class interval of convenient m this computing guide ) 

1, 2, 5, 10, 20, etc,, and 

divide the range into -- 

classes of equal width. Cumu- Cumu- 

Fifteen or more classes Fre- lative latlve 

are desirable quency Fre- Per- 

3 Tally the number of cases Scores Tallies (f) quency cent 

with each score ---.—- 


4. Write the number of tal¬ 
lies in the Frequency (f) 
column. Add this column 
to get N, the number of 
cases. 

5. Begin at the bottom of 
the column and add fre¬ 
quencies one at a time to 
determine the cumulative 
frequency, the number of 
cases below each division 
point 

6 Divide the cumulative fre¬ 
quencies by N to deter¬ 
mine cumulative percent¬ 
ages. 


50-54 

mi 

5 

45-49 

// 

2 

40-44 

mi mi // 

12 

35-39 

mi mi mi // 

17 

30-34 

mi mi im 

14 

25-29 

mi mi 

10 

20-24 

mi mi 

10 

15-19 

in 

3 

10-14 


0 

5- 9 

// 

2 


75 

N 


75 100 

70 93 

68 90 

56 75 

39 52 

25 33 

15 2 CP 

5“ 7 

2 3 

2 3 

0 0 


* 5 cases fall below 19 5, 15 below 24 5. etc 

* 20 percent of the cases fall below 24 5, 20 is the cum¬ 
ulative percentage corresponding to a raw score of 24 5 
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if any) between the two groups, saying that 27K cases are above him and 
12/i cases below. Since 125a is 31 peicent of 40, Ins peicentile scoie is 31 

By this method of computation, the person exactly in the middle of the 
gioup is at the 50th peicentile The 50th percentile is called the median. 
The median indicates the peifoimance of the most typical member of the 
gioup 

A graphic pioeeduie is often used to compute peiccntiles. The graphic 
method disiogauls megulaiities in the disliibution of semes m a paiticulai 
sample and theiefoie gives a better estimate of what may be expected when 
fuithei gioups aie tested. Computing Guide 1 demonstiatcs this method, 
using a set of Bennett TMG semes foi a ninth-giade class. 

Tiansfmining nnv semes to peicentile scoics changes the shape of the dis- 
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tribution. In Figure 11, raw scores for the same ninth-grade class have been 
plotted. The distribution is high at the center and tapers away at each end. 
When each score is changed to a percentile equivalent in the lower part of 
the figuie, the distiibution is noaily rectangular. With larger samples the 
peicentile distribution becomes almost perfectly rectangular. Poisons near 


Frequency 
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I 
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FIG II Raw score distribution and distribution of percontile equivalents for samo group 


tire middle of the law-score scale aie spicad apait, persons at the end are 
squeezed together It is important to realize that a ratlin huge pom-nUle 
diifeiencc near the median often represents a small diffoiencc m pm form- 
ance. Conversely, the difference between the 90th and the 99th percentiles, 
though it looks small on this scale, may be as gieat as the difference between 
a five-minute and a four-minute mile. 

Averaging two peicentile scoies gives a icsull difleieul from what would 
be obtained if the average of the corresponding raw seoies were changed to 
a peicentile score Raw scoies of 14 and 22 avoiagc 18, whit h has a p'eiem- 
tile equivalent of 6, The peicentile equivalent of 14 is 4, and that of 22 is 13 
If we had made the mistake of aveiagmg peicentilcs, our answer would have 
been 8M instead of 6 While one raiely makes a huge error by averaging per¬ 
centiles, the cumulation of such errors can distort statistical findings, and 
therefore percentiles should not be averaged, Tire median is the proper 
measure of central tendency in analyzing percentile scores. 
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Percentile scores must not be compared unless the groups on which they 
are based are comparable. Published norms for tests may be based on dis¬ 
similar groups. Purdue Pegboaicl noims are based on industrial trainees, 
Foim AA of the TMC gives norms foi engineenng freshmen. A peison at the 
75th percentile on both tests according to these norms does not have equally 
high standing m these two abilities. For the TMC, separate tables for in¬ 
dustrial tiainccs are available, and we find that a score which is at the 73id 
percentile among fieslnnen is at the 90th percentile among trainees Wher¬ 
ever percentiles aie used, the norm group involved must be kept in mind. 


TABLE 2. Bonnet! TMC Norms for Boys 
in Grade 9 


Percentile 

Score 

99 

54 

95 

47 

90 

44 

85 

41 

80 

39 

75 

38 

70 

36 

65 

35 

60 

33 

55 

32 

50 

31 

45 

30 

40 

29 

35 

27 

30 

26 

25 

23 

20 

22 

15 

20 

10 

17 

5 

14 

1 

5 

Number of cases 

833 

Mean 

30.8 

s d, 

104 


Soum i Bciimtt, 10'17 


In the manual foi the Bennett TMC, the usci finds a collection of pci can¬ 
dle eonvcision tables peimitting lum to compare his subject with vairous 
u'feicnce gioups. One such table is rcpioduced heic. This can be used 
like the table prepared m Computing Guide 1, although the tables are ai- 
langcd diffeiently. 

A score of 39 falls at the 80th percentile in the noun table, even though it 
was nearer the median (G8th percentile) of the small class (Computing 
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Guide 1). The median of the small class (34) is higher than the median of 
the standardization gi oup, the class is evidently a superior group This dem- 
onshates the value of carefully collected nouns. A peison who is just aveiage 
m mechanical comprehension would not be especially encouiagcd to choose 
a mechanical field The average student within this class, however, pioves 
to be superior (63rd percentile), oompaied to students gcneially It is this 
larger group with whom he will compete after high school. 

T2. Interpret the following record of ability test scores for one person, where all 
scores are percentile scores based on a random sample of adults' Verbal, 54; 
Number, 46; Spatial, 87; Reasoning, 40 

13. Estimate Alfred's percentile score in each of the four tests he took (question 8, 
p 72) 

14 Why does the table of Bennett norms begin at 1 and stop at 99, instead of 
ranging from 0 to 100? What percentile score corresponds to a raw score of 
60 (perfect)? 

15. Scores usually change when a test is repeated, because of chance errors of 
measurement If each of the following persons changes two points up or down 
in raw score on the TMC, how much would his percentile score change? 

a. A person with a percentile score of 55 on the first test. 

b. A person at the 95th percentile on the first test. 

16. The scores below are the times, in seconds, required by a group of persons to 
perform an easy Block Design problem. Prepare a table of percentile equiva¬ 
lents for this group- 

52 34 41 42 46 45 27 48 35 35 38 29 54 36 33 30 

48 39 44 36 36 34 51 40 30 33 37 41 56 32 48 35 

37 28 28 45 31 39 31 27 35 36 34 42 38 33 33 31 

39 28 36 33 37 36 34 54 34 32 33 38 

17 According to the table prepared in problem 16, how much difference in sec¬ 
onds does a difference of 10 percentile points represent? 

Standard Scores 

Mean and Standard Deviation The second common way to Minimal i/o the 
perfoimance of a gioup is to use the mean and standaid deviation The 
mean'{ M) is the aiitlnnetieal aveiage obtained when we add all stoics 
and divide by the numbei of scoies The slandtud dentition (sd, oi s) is a 
measme of the spiead of scoics The vaiialion of two sets ol scoies may be 
diffeient even though the aveiages ate the same Figuie 12 slums the 
smoothed distribution of scoies of two classes taking the same test F.vcn 
though the gioups are simihu in mean ability, the distiilmtions aie not at all 
alike. Group B contains far more veiy supeuoi and lnfenoi cases and tlieie- 
fore has a larger standard deviation 

One method of computing the mean and standaid deviation is outlined m 
Computing Guide 2 The complicated foimula makes it hard to see just what 
the standard deviation means, but m effect it is an average of the deviations 



1. Begin with the raw scores 

37 

43 

27 

44 

27 

27 

26 

31 

35 

42 

50 

(these are scores of 75 

35 

43 

36 

26 

50 

47 

36 

26 

32 

32 

38 

ninth-grade boys on Bon¬ 

36 

21 

24 

40 

39 

35 

38 

36 

38 

21 

17 

net! Form AA). 

26 

35 

22 

18 

50 

30 

38 

50 

16 

45 

8 


34 

26 

34 

28 

41 

27 

39 

41 

30 

23 

33 


22 

31 

36 

40 

54 

24 

22 

8 

33 

42 

41 


41 

31 

34 

36 

32 

20 

22 

34 

41 



2. Identify the highest score 

Highest 

score 

— 54j 

lowest 

score 

as? 8; 

range 233 

46 


and tha lowest score If 
there Is a wide range, 
choose a class Interval of 
1, 2, 5, 10, 20, etc, and 
divide the range into 
classes of equal width 
Fifteen or moro classes 
are desirable. 

3. Tally the number of cases 
with each score 

4. Write the number of tal¬ 
lies In Ihe Frequency (f) 
column Add this column 
to get N, the number of 
cases. 

5 Select any Interval, usu¬ 
ally near the middle of 
the distribution. Call this 
the arbitrary origin. 
(Here, the 30-34 Interval 
Is used,) 

Determine the deviation d of 
arbitrary origin. 


Class Interval of 5 will be used (A smaller intorval, 
such as 2, would be preferable but would be in¬ 
convenient in this computing guide ) 


Scores 

Tallies 

Fre¬ 

quency 

(f) 

d 

fd 

fd J 

50-54 

//// 

5 

4 

20 

80 

45-49 

// 

2 

3 

6 

18 

40-44 

//// //// // 

12 

2 

24 

48 

35-39 

//// //// //// // 

17 

1 

17 

17 

30-34 

//// /;// //// 

14 

0 

0 

0 

25-29 

//// //// 

10 

-1 

-10 

10 

20-24 

//// //// 

10 

-2 

-20 

40 

15-19 

/// 

3 

-3 

- 9 

27 

10-14 


0 

-4 

- 0 

0 

5- 9 

// 

2 

-5 

-10 

50 



75 


-{- ] 8 

290 



N 


Xfd 

Sfd 3 


each intorval from the 


6. Multiply In each row the ontrles In tho f and d col¬ 
umns, and enter In the fd column 


7. Multiply the entries In the d and fd columns, and 
enter in the fd 2 column Add tho fd and fd’ col¬ 
umns. (S Is a symbol meaning "sum of ") 

8 Substitute In tho following formulost 


c = 15 = .24 

M = 32 0 + 5 (0 24) 

M = 32 0 4- 1 20 <■* 33 20 


c (correction) <=» ~— 
N 


M (mean) «=> A O. + I x c 


s.d. = I x 


fsfd* - N 
\J N - 1 


s d, 5 x 


J 290 - 7 5 (0 24) J 
I 74 . 


s.d = 5 x ■ 


/290 - 75 ( 058) 
s.d - 5 x ^ 74 - 

1290 - 4.3 , /2B5 7 

s.d. = 5 x V3 86 = 5(1 96) 
s.d = 9 80 


/285_7 

74 


A O Is the midpoint of the score-interval selected 
as arbitrary origin, and / is Ihe width of the Interval 


COMPUTING GUIDE 2 DETERMINING THE MEAN AND STANDARD DEVIATION 
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of persons’ scores from the gioup mean. We might measure the spread of 
scores by finding how far each poison is from the mean and aveiagmg 
(lgnoiing the dnection of deviation). For mathematical reasons, the stand¬ 
ard deviation formula takes the average of the squaies of the deviations 



50 60 70 80 90 40 50 60 70 80 90 100 

Scores of Group A Scores of Group B 

FIG 12 Distributions of scores of two classes on the same test 


rather than of the deviations dnectly, and then takes the squaie root of the 
result. 

The standard deviation indicates how much vanation there is within a 
gioup, In much statistical analysis the squaie of the standaid deviation, 
called the variance of the distribution, is used as an index of vanation, 

18. a. Compute the mean and standard deviation for the Block Design scores 
given in problem 16, p 78. 

b. How does the mean compare with the median computed previously? 

C. What is the approximate percentile rank for a score 2 s d. above the mean 
in this distribution? 

Conversion Scales Wo can leplace the poison’s law scoic with a derived 
scoie showing his position lelative to the mean To say how fai above or be¬ 
low he is, we use the standaid deviation as a unit. We can say that one per¬ 
son is 2.5 s.d. above the mean and anothei is at — 1 s d (l e , is 1 s d. below the 
mean). Fiom Computing Guide 2, we see that a score of 43 is about 1 s.d. 
above the mean, foi example Denved scoics based on stanchud deviation 
units are called stanclaul scores 

Computing Guide 3 shows how to convcil law scoics lo a standaul-score 
scale with a mean of zcio and each s d above the mean counted as one unit. 
One can also convert scoics to the “T-scoie” system winch sets the mean at 
50 (to avoid negative scoics) and each s.d equal to 10 points. As Figuie 13 
shows, changing raw scoies mto standaid scoics does not allei the loim of 
the distnbution (except for slight changes due to lcgiouping) 

Wbeieas the Bennett TMC piesents norms m peieenlile feum, the 
Wechslei Block Design noims are m standtud-scoic foim (called "scaled 
scoies” by Wechslei). As an example, Table 3 gives the nouns foi people 
aged 20-24 The range of conveited scoies is fiom 0 to 19, because Weclis- 
ler chose to set the mean equal to a standaid scoie of 10, and to count 
each s d. above or below the mean as 3 standard-scoie points. 



Frequency 


10 - 

5 - 

I 

RI5II 


“o 10 2 

0 | 30 | 


1 Raw Score 
; 1 

Frequency 

15 r 

w 

10 

': ,'' 
, - ;• 

r 


5 1 

- ■' 

„ ' ! 

0 - - ■* 

3 2 

1 1 ' 

1 0 

z Score 


FIC 13 Distributions of raw scorns and standard scoros for the same gioup 


1. 

Begin with the raw scores to be con- 

For the data in Computing Guide 2, 


verted, and find the mean and s d. 
as in Computing Guide 2. 

M - 33 2, s d = 9 80 

2 

To obtain z-scores, express each raw 

For raw score 50 


score as a deviation from the mean. 

Divide by the s.d. 

50 - 33 2 _ 16 8 ^ 

* " 9.80 9 80 ’ ' 


raw score — moan 

For raw score 25 


z score =* - - , , 

slandaid deviation 

25 - 33.2 -8 2 

* ^ ' 9.80 ’ ‘ 9.80 " '"' 8 

3 

To obtain T scores, multiply the z 

For raw score 50, z — 1 7 


score by 10 and add to 50 

T = 50 + 10 (1 7) = 67 


y „ , 10 (raw score — mean) 

T score = 50 H-- —-- 

standard deviation 

For raw score 25, z = — .8 

T = 50 H- 10 (- 8) = 42 


COMPUTING GUIDE 3 CALCULATION OF STANDARD SCORES 
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TABLE 3, Standard-Score Equivalents of Raw Scores for the Block Design Test, 
Ages 20-24 


Scaled score 0 

Raw score 0 

Scaled score 10 

Raw score 32—34 


1 2 3 

I 2 3-8 

II 12 13 

35-37 38-40 41-43 


4 5 6 

9-12 13-16 17-20 

14 15 16 

44-45 46-47 — 


7 8 9 

21-24 25-28 29-31 

17 18 19 

48 — — 


Source Weclislcr, 1915, p 103 

One can develop standaid scores- using other values for the mean and s.d 
Table 4 summarizes seveial standard-seme systems now m use While theie 
have been logical leasons for many of the vaiiations, only confusion insults 
fiom so laige a vauety. It is now recommended (Technical Reiammenda- 
tions, 1954) that test developers use the T-seme system, with mean 50 and 
s d. 10 If it is desnable to keep conveited scenes below 10 so that they will 
fit into one column of a standaid punehcaid for statistical opeintions, the 
stamne scale (pronounced stay-nine) is iccommended The z convulsion is 
used in statistical and theoretical woik, but is not often used by lest mtor- 
preteis The lemaining systems may be expected to die out in lime 

19 The 1959 Stanford-Binet test fixes the mean IQ at 100 and the standard devia¬ 
tion at 16 Express in T-score form the following IQs 100, 84, 132, 150 

20. In Computing Guide 2, what standard score corresponds to a raw score of 40? 
48? 4 9 

21. Draw a figure to show the relation between raw scores and T scores in Com¬ 
puting Guide 3. 

Smoothed Score Distributions The fiequency distribution shown at the lop 
of Figuies 11 and 13 is jagged, but if moic cases weie added and smallei 
class intervals weie used it would become relatively smooth We can esti¬ 
mate the most likely shape of that distubution by drawing a smooth c-ui ve as 
shown m the top portion of Figure 14 This distubution is not peifectlv 


TABLE 4. Standard-Score Systems 


Mean 

Set 

Equal 

to 

s d 

5et 

Equal 

to 

Standard 
Score Cor¬ 
responding 
to 1 s d 
Above Mean 

Standard 
Score Cor¬ 
responding 
to 2 s d 
Below Mean 

Name of System, Remarks 

0 

i 

1 

-2 

i scores, prominent in mathematical 
theory of testing 

5 

2 

7 

1 

Stamne scores 

10 

3 

13 

4 

Used for Wechsler subtests 

50 

10 

60 

30 

T scores, most widely used system 

100 

15 or 16 

115 or 116 

70 or 68 

Deviation IQ used by many mental 
tests 

100 

20 

120 

60 

Used for aptitude tests of U S Employ¬ 
ment Service 






Frequency 



Normalized z Score 

_I_I_I_1_I_I_L 

20 30 40 50 60 70 80 

Normalized T Score 

FIG 14. Smoothed distribution of row scores and distribution of normalized scores 

symmetrical, but it tails off on both sides Most tests yield distiilmtions of 
this geneial cliaiactei Since evciy dustnbution has its own shape, thou 1 is 
some advantage in convciling the scon.' scale so that evciy test has the same 
distribution lomi The nounal piobabihty emve is used foi this 
The Normal Probability Curve The nounal cmvo (Figuie 15) is a sniootl), 
symmetnc licqncncy curve which lias nnpmtanl mathematical piopwtics 



FIG 15 Percentage of cases falling m each portion of a normal 
distribution 


The standaul deviation is the distance liom the mean to the "point of inflec¬ 
tion” on the shouldei of the normal euive This, as shown m Figuie 15, is the 
pomt which sepaiates the convex, hill-like poition horn the concave tail 
The normal curve is important in the theory of probability and is used in 
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statistical analysis to deteimine whether a particular experimental result may 
be a chance occunence. 

Many biological measures such as heights of American men fall into a 
neaily normal distribution, perhaps because chance combinations of clnomo- 
somes deteimine the vaiiablc, In psychological tests also, it is veiy common 
to obtain normal distributions of scores Eaily investigators thought it a 
natural law that abilities are normally distubuted, It is now realized that 
such a statement is meaningless, since the shape of the distiibution depends 
on the scale of measuiement. The distiibutions of actual test scores depend 
on the way the test is constructed By selecting items suitably, wc can change 
the score distiibutions to U-shaped, flat, skewed to one side, etc (F M. Lord, 
1952; Cronbach and Warnngton, 1952), The use of noimal naves m test 
scaling is llieiefore meiely a convenience and is not based on any “normal 
distribution of behavior” in nature. 

If we slice a noimal distiibution into bands one standaid deviation wide, 
a fixed peicentage of the cases always falls in each hand, As Figme 15 
shows, 34 percent of the cases fall between the mean and T1 s cl, In the 
next interval are 14 percent, and in the third 2 peieent. Since 99,6 pci cent of 
the cases fall between +3 s d. and —3 s.d, the whole range of test scores 
is somewhere near 6 standaid deviations (less, when the gioup is small), 
These facts arc handy foi inteipretmg standaid scenes and for loughly 
leconstiucting the score distiibution if the mean and s d. aic known 

Whenever we have, or assume, a noimal distiibution, we can quickly 
conveit standard scores to percentile scoics, and vice veisa. Below the mean 
(z scoie of zeio, or T score of 50) aie 50 percent of the cases. Below +1 s d. 
are 50 + 34 or 84 peieent of the cases, hence a T score of 60 equals a per¬ 
centile scoie of 84. 

22. What percentile rank corresponds to a score of 2 s d above the mean? To a 
score 1 s d below the mean? 

23. In a normal distribution, what is the relation of the mean and median? 

24. Assuming that scores are normally distributed on a test where the moan is 60 
and s.d. is 8, interpret the following scores. Sara, 64, Harriet, 68, Charles, 87; 
Bob, 48. 

25. Using Figure 15, interpret each of the following in percentile terms: a x score 
of 3 0; a z score of —2 0, a T score of 40, a T score of 65. 

Normalized Scores. Scores are somewhat easier to inteipret if all tests 
are reduced to a scale having a known distiibution Foi tins pm pose, testers 
most commonly employ normalized standard stoics These scoies aic ob¬ 
tained by sketching a distribution to make it neaily noimal, and then chang¬ 
ing it to standard-score foim. One piocedure which accomplishes this result 
is to compute peicentiles by the method of Computing Guide 1, and then 
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TABLE 5. Relations Between Standard Scores and Percentile Scores, When Raw 
Scores Are Normally Distributed 


Distance 
from Mean 
in s.d 

U Score) 

T Score 

Percentile 

Score 

Percent 
of Cases 
in "Tail" 
of Curve 

Percentile 

Score 

7 Score 

Distance 
from Mean 
in s d 
(z Score) 

3.0 

80 

99.9 

0.1 

01 

20 

-3.0 

29 

79 

99 8 

0.2 

02 

21 

-2.9 

2.8 

78 

99 7 

0,3 

03 

22 

-28 

27 

77 

99.6 

04 

04 

23 

-2.7 

26 

76 

99 5 

05 

05 

24 

-26 

2.5 

75 

99 4 

0,6 

06 

25 

-25 

24 

74 

99 2 

0.8 

08 

26 

-24 

2.3 

73 

99 

1 

1 

27 

-2.3 

22 

72 

99 

1 

1 

28 

-22 

2.1 

71 

98 

2 

2 

29 

-2 1 

20 

70 

98 

2 

2 

30 

-2.0 

1.9 

69 

97 

3 

3 

31 

-1.9 

1 8 

68 

96 

4 

4 

32 

-1.8 

1 7 

67 

96 

4 

4 

33 

-1 7 

1 6 

66 

95 

5 

1 

5 

34 

-1.6 

1 5 

65 

93 

7 

7 

35 

-1.5 

1.4 

64 

92 

8 

8 

36 

-1.4 

1 3 

63 

90 

10 

10 

37 

-1,3 

1.2 

62 

88 

12 

12 

38 

-1.2 

1 1 

61 

86 

14 

14 

39 

-1.1 

1 0 

60 

84 

16 

16 

40 

-10 

09 

59 

82 

18 

18 

41 

-09 

08 

58 

79 

21 

21 

42 

-0.8 

0.7 

57 

76 

24 

24 

43 

-0,7 

0.6 

56 

73 

27 

27 

44 

-0.6 

05 

55 

69 

31 

31 

45 

-05 

04 

54 

66 

34 

34 

46 

-04 

03 

53 

62 

38 

38 

47 

-03 

02 

52 

58 

42 

42 

48 

-0.2 

0.1 

51 

54 

46 

46 

49 

-0.1 

0.0 

50 

50 

50 

50 

50 

0.0 


to road hom Table 5 the standard score corresponding to that percentile 
value. 

In our illustrative distribution of ninth-grade TMC scores, 95 is the poicen- 
tilc equivalent of a unv scoie of 50, and Table 5 indicates that the corre¬ 
sponding noimahzcd T some is 66 This compaies with a T score of 67 (not 
normalized) obtained in Computing Guide 3 Such small changes, stietching 
out the scale at the upper end and compressing it at the lower end, pioduces 
the distribution shown at the bottom of Figuie 14. This distribution is more 
symmetric than the raw-score distribution If more cases had been used, the 
smoothed distubution would be completely normal. 
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Profiles 

With denved scores, it is possible to compare performance on one test 
with that on another. This is illustrated m the Diffeiential Aptitude Tests, a 
set of eight tests for different abilities. One section is a new form of the Ben- 

Verbal Numerical Abstract Space Mechanical Clerical Spelling Sentences VR + NA 



FIG 16 Profile of Robert Finchley an the Differential Aptitude Tests (Bennett, Seashore, and Wes 
man, 1951) 

nett TMC The tests vaiy in length and in difficulty, so that fiorn the raw 
scoies alone one cannot judge the peison’s gieatcst ability After law scores 
aie changed to peicenliles or noimah/.ed standaul seoics, one can plot a 
profile showing his relative standing in all fields The profile shown in I’Iguie 
16 is that of a lugh-school junioi, the nouns for junioi hoys weio used to eon- 
veit his scoies Robert is highly snpeiioi m the vauous masoning tests, and is 
almost equally outstanding in all of them. Ilis last llnee scoies ate quite pool. 

Comparison of Systems 

Since the manual sometimes offers moie than one type of conversion table 
and since the user often has to develop local nouns foi tests, he needs some 
basis foi deciding which system of scores is preferable, 

The percentile score has these advantages, it is readily understood, which 
makes it especially satisfactory for leportmg data to persons without statis¬ 
tical training, it is easily computed, it may be interpreted exactly even 
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when the distribution of test scores is nonnormal. The disadvantages of the 
peicentile scoie aie these it magnifies small differences m score near the 
mean which may not he significant, and it i educes the appaicnt size of large 
differences in scoie neai the tails of the distributions, it may not be used m 
many statistical computations 

The advantages of stamlaid scores are as follows. Difleicnees in standaul 
scoie aie proportional to diffeicncos m raw scoie; use of standard scores in 
aveiages and correlations gives the same lesult as would come iiom use of 
the law scoies The disadvantages are that standard scoies cannot he inlei- 
pieted leadily when distnbutions axe skewed, and that untiaincd peisons 
generally find them difficult to understand. In geneial, statisticians prefer 
standaid scoies while those who mteipret tests directly to laymen prefer 
peicentiles 

Noimalived standaid scores have become mcieasingly popular and aie 
generally suitable The noimalized scoies spread out cases m both tails of 
the distubiitioii and yet can readily he tianslated into peicentiles' The DAT 
piofile ioim shown m Figme 16 illustiates typical euucnl piactiee, One can 
lead off standaid scoies (lioniialr/ed) when they arc needed foi statistical 
compausons hut can talk to the subject in teims of peicentiles. 

26. A teacher wishes to convert scores on class examinations so that he can tell at 
a glance how well a person is doing and can average all tests equally in the 
final grade Should he use raw, percentile, or standard scores? 

27. In a certain college, all freshmen are given a "scholastic aptitude test " The 
results are to be mimeographed and confidential copies given to all professors 
Should the report use raw scores, standard scores, or percentiles? 

28. The psychometrist gives a wide variety of tests to veterans needing counseling 
After each man has taken from four to eight tests, results are to be placed on a 
standard report form so that performance on all tests can be compared by the 
counselor, in conference with the veteran What problems will be encountered 
if all results are reported in percentiles? If all results are reported as standard 
scores? 


NORMS 

The test manual assists the asm to niloipiel scoies by picscnling inhumation 
legaiding “nonnal” pci foimancc. This inhumation lakes the; loun of one or 
moie tables The usci should have no difficulty m inteipreling the informa¬ 
tion piovided in the manual, although every manual oigani/es its tables a 
bit differently For example, the Bennett test manual provides tables of per¬ 
centile equivalents (of Table 2) so that the user may compare an individual 
with any one of the following gioups' 833 ninth-grade boys, 370 tenth-grade 
boys, 613 engineering-school freshmen, 1836 candidates for policeman and 
fireman positions, 548 candidates for apprentice tiaimng, 145 candidates for 
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engineering positions, 16S7 workers in a paper factory, 226 trainees in an air¬ 
plane factory, and fifteen other gioups The DAT form has separate norms 
for hoys and girls, for each grade from 8 through 12. This batteiy is primarily 
used in high-scliool guidance 

No detailed information about the Bennett norm groups is given; hence a 
user of the test can only guess whether his situation resembles that of the 
tenth grade or of the “engineeiing positions” in the table. These noims wore 
published in 1940. They may be contrasted with the more modem descnp- 
tion given in the manual for the DAT Mechanical Reasoning Test (Bennett 
et al, 1959): 

Over one hundred school systems from all major gcogiapluc areas 
contributed to the noimative sampling In some cities the whole school 
population in five giades (eight thiough twelve) was tested, in some, 
all pupils in one to four giades were tested; in some largci cities, classes 
m repiesentative schools (as judged by the local leseaich directoi) 
were examined. A complete listing of the normative sample showing 
the numbei of students in each grade m each community is [obtaina¬ 
ble] . . The total numbei of students included m the piesent norms 

(1952) is ovei 47,000 

lire states which contributed to the normative study , . and the 
numbei of communities m each were: Aiizona, 1, California, 5, Colo¬ 
rado, 1, . West Virginia, 2 The testing of the normative sample 

was done throughout the school year It is appiopiiate to assume that 
the noims repicsent mid-year peiformance 

Some testers attach too much importance to noims, cither when they select 
tests or when they mteipret scores. Others, lecogm/ang that norms aie 
helpful, are unduly impressed by the number of cases used in compiling the 
norm tables We shall see, howevei, that the size of the slandaidizmg sample 
alone does not indicate how satisfactory the norms aic 

Noims are unimpoitant m many uses of tests, particularly when one in¬ 
tends only to identify individual dilFeienccs within a group For example, 
norms are of little use to the employment managei who wishes to hho the 
bughtest ten peisons in a gioup of applicants Nouns aie also of little value 
where a cntical scoie is used. If a pcisonncl managei knows fiom actual trial 
that persons with scores of 72 on test A make satisfactoiy punch-piesi opera¬ 
tor, it is not necessaiy foi him to compaie applicants with national norms. 

In guidance and clinical work, it is extremely impoitant to use norms m 
intei preting scores The peison’s position relative to his group has to he 
fixed as definitely as possible A child who scores at the 20th percentile on 
a test of leadmess for first grade will have difficulties m school, but he is by 
no means a rare exception If our norms placed him at the 2nd percentile m- 
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stead, we would not expect him to fit into the regular program at all. 

For many test inteipietations, local norms aie far moie impoitant than 
laige-gioup norms. Classes diffei so much that a child who is at the 20th per¬ 
centile in the typical fiist guide would be at the 2nd percentile m some 
other school which cm oils pupils fiom a select ncigliboihood. One example 
of such school-to-school differences is given in Table 6. 


TABLE 6. School-to-School Differences in Mechanical Reasoning Scores of Ninth' 
Grade Boys 


City 

Name of School 

Approximate 
Number of Cases 

Mean 

s d. 

Worcester, Mass 

Commerce 

60 

28 3 

127 

Woicester, Mass 

Five other schools 

190 

29.4 

11 9 

St Joseph, Mo 

Benton 

70 

30 3 

11 7 

Sf Joseph, Mo. 

Lafayette 

70 

35 8 

10.7 

St Paul, Minn. 

Wilson 

50 

34 0 

103 

Independence, Mo 

Chrisman 

175 

37.6 

12.4 


Sown i Delimit 1 1 <il , l't 17 


Nouns, to he useful, must po t m il the testei to compute th e subject with 
his piospeotive companions* and c ompet itors. The manual fox the Weclislor 
intelligence scale gives nouns based on adults in general. Hut a hoy who is 
above the aveutge fm his age, computed to people in genetal, may be be¬ 
low aveuige among college heshmen If we wish to pi edict whethei he* can 
succeed in college, we need Weelislci nouns based on college students alone. 
Moie than that, we need to know the nouns for the pailiculiu college lie 
plans to attend. 

Sections of the countty, occupational gioups, and schools vary widely in 
Ameiica. One? example of the geogiaphicul diffei cnees to he expected is 
given in Table 7. Tilt* SSCQT was given foi several years to young men 
who might wish to attend college, those earning scores of 70 and over were 

TABLE 7, Geographical Differences in Selective Service College Qualifica¬ 
tion Test Scores 


Percentage of Freshmen Registrants Scoring Below 


Residence 

50 

60 

70 

75 

80 

New England 

1 

4 

44 

71 

92 

Middle Atlantic 

1 

3 

40 

69 

92 

East North Central 

2 

4 

45 

74 

94 

West North Central 

1 

5 

44 

73 

93 

South Atlantic 

6 

11 

57 

78 

94 

East South Central 

7 

15 

66 

85 

97 

West South Central 

9 

16 

63 

84 

95 

Mountain 

2 

4 

46 

75 

94 

Pacific 

2 

5 

45 

72 

93 


Source Statistical Studies, 1855, p. 89 







90 ESSENTIALS OF PSYCHOLOGICAL TESTING 

generally allowed to postpone their military service until completion of col¬ 
lege The test, designed to be as fair a measure of scholastic aptitude as pos¬ 
sible, called foi verbal and quantitative reasoning Theie aie huge diffei- 
ences among regions: in the Midwest and East the average legistiant was at 
a high enough level to be exempt fiom immediate diaft, wlieieas m the 
South only 40 percent of the registiants have peiloimed at this level. 

Whenever he can, the test interpreter should prepale noims foi the gioups 
with which he deals directly A high-school counselor could profitably use 
information about the score distnbution foi all boys m his high school, for 
boys m the shop curriculum, for boys who latei attend the local college, and 
foi woikeis in ceitain large local industiies He uses published nouns be¬ 
cause it takes time and effort to accumulate local noims 01 because, as is 
often the case, he cannot possibly accumulate local nouns A clinician, foi 
example, has no chance to piepaie noims for a landom sample of 60-yeai- 
old men, yet he needs to compaie the men he tests with a community avei- 
age 

Except wheie the pumaiy use of a test is to compaie individuals 
with then own local gioup, noims should be published at the time of ie- 
lease of the test foi operational use Noims should ie(er to defined ami 
clearly desenbed populations. These populations should he the gioups 
to whom useis of the test will oidinaiily wish to compaie the 1 poisons 
tested If appreciable diffeicnees between gioups exist (eg, gioups 
dilfeung in age, sex, amount of training, etc ), and if a poison would 
oidmanly be compared with a subgioup lather than with a uuidom 
sample of peisons, then separate noim tables should be piuvulod m the 
manual for each gioup [Technical Recommendations , 1954 1 

These aic official recommendations (seep 101) regauling test noims All 
these principles have been violated at times m the past Tests have been 
published with no nouns Others have offered noims based on inadequate 
samples, and often the samples aie impropeily desenbed. The difficulties m 
this aspect of test mterpielation aie pointed out forcibly m these nunaiks by 
a test pubhshei (H G Seashoie and J. II Ricks, Jr., 1950). 

Legitimate and illegitimate general noims abound in cunont test 
manuals. People-m-geneial noims are legitimate only il they aie based 
upon caiefnl field studies with appropriate conliols ot inguinal, socio¬ 
economic, educational, and othei faetois—and even then only il the 
sampling is carefully desenbed so that the test usei may be fully awaie 
of its inevitable limitations and deficiencies The millions enteiing the 
aimed foices dunng Woild Wai II provided the basis of some fanly 
good noims on young adult men, though mainly on tests not available to 
the public The standardization of the Wechslei Intelligence Scale for 
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Children is a recent attempt to secure a representative smallei sample 
of children aged 5-15 for setting up tables of intelligence quotients 
which may be considered generalized norms for children. The earliei 
woik of Terman’s group to set up good national nouns on a small, well- 
chosen sample is well known. In the standui diving of some educational 
achievement tests, nationwide samplings of cliildion of each appropri¬ 
ate guide in age and from different types of schools in all paits of the 
country aie sought in an client to produce nouns that aie truly genoial 
Cox a given span of grades or ages. 

Unfoitunutely, many alleged general nouns repoited in test manuals 
are not backed even by an honest effort to secuie repiesentative sam¬ 
ples of people-in-general Even tens or hundieds of thousands of cases 
can fall woefully short of defining people-in-general. Inspection of test 
manuals will show (oi would show if inhumation about the norms vveie 
given completely) that many such massed nouns aie meiely collections 
of all the scenes that oppoitumty has peimittod the author or publishoi 
to gather easily Lumping together all the samples seemed moie In 
chance than by plan makes foi lmpiessively huge nuinbcis, but while 
seeming to simplify inter pi elation, the nouns may dim oi actually dis¬ 
tort the counseling, employment, oi diagnostic significance of a score 
With oi without a plan, eveiyone of coin sc obtains data where and 
how lie can. Since the stundaiclization of a test is always dependent on 
the coopeialion of educatous, psychologists and personnel men, the fore¬ 
going comments are not a plea foi the lejection of available samples hut 
for then eoireet labeling If a manual shows “geneud" nouns for a vo- 
cabulaiy test based on a sample two-thuds of which consists of women 
office wuikeis, one can piopcily laise his test-wise eyelnows Tlieie is 
no reason to accept such norms as a good generalization of adult—oi 
even of employed-adult—voeabulaiy It is better to set up noims on the 
occupationally homogeneous two-thuds of the gioup and fiankly call 
them nouns on female office woikers. Adding a few moie miscellaneous 
cases does not make the sample a tuily geneial one 

As a lule, then, m leading a test manual we should ieject as tieaeliei- 
ous any alleged national or geneial norms whose generality is not sup- 
poited by a cieai, complete icpoit on the sample of people they lep- 
leseiit, cn norms which aie obviously ojipmlumslic accumulations of 
samples weighted by then size according to chance lather than logic, 

If the manual describes the norm sample adequately, the usci can judge 
the nouns by these questions' 

* Does the standard group consist of the sort of poison with whom my 
subjects should be compared? 
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® Is the sample representative of this group? 

© Does the sample include enough cases? 

® Is the sample appropriately subdivided? 

Noims foi any gioup must be a fair description of that group. A fair sample 
is assured when the test maker takes an exactly random sample of the popu¬ 
lation (e g., of all American college freshmen). Since this is difficult, the test 
maker usually tries to obtain a mixture of cases fiom all segments of the 
population, foi college students, he would draw on large and small colleges, 
pnvate and public, fiom all sections of the countiy If any segment is too 
heavily lcpiesented, the norms will be biased. 

In a small sample, accidental inclusion of a few additional good 01 poor 
cases will make the noims umepiesentative. In laige samples, such vaiia- 
tions should cancel out No fixed numbei of cases is required foi dependable 
noims It is better to keep the sample stuctly repiesentative, and small, than 
to accumulate large numbeis of cases which may not be representative The 
most unsatisfactory noims aie those based on whatever cases happen to be 
conveniently available A manual may lepoit, “The noims aie based on 
scores of 2700 sophomores taking general psychology at foui Western col¬ 
leges.” Noims such as these are useless unless the teslei wishes to know how 
his cases compaie with sophomoic psychology students at western colleges. 

Even when the noim sample is laige and representative, it should oidi- 
nauly be subdivided if impoilant, clearly identifiable subgroups cam dif¬ 
ferent average scoies. On the DAT Mechanical Reasoning Test, the Guide 9 
median is 32 while that for gills is only 19, As Wesman (1949, p 227) points 
out 

Counseling would be very different if one had only the single sex 
scoies [in the noirn table] For example, a boy with a Mechanical 
Reasoning scoie of 40 (in Giade 10) would be close to the 75th pei- 
centile on a combined distribution scale With only that inhumation, 
the counselor would be compelled to consider him as having enough 
ability to compete favorably m a cumculum oi occupation requiring 
mechanical understanding. If he enteicd such a cuniculum, however, 
his competition would be almost enliiely male, Computed with boys 
only, his scoie of 40 leaves him at the 50th percentile, a i unking not at 
all supcnor. 

A new method of developing test norms may become prominent silently, 
This method involves cahbiciting a new test (oi a test needing new nouns) 
against another well-standardized test This is similai to the pioccduic the 
maker of an aneioid barometer uses when he places maiks on its dial so that 
readings agree with an accurate mercury barometei The method has not 
been useful in psychology because no instrument has had norms sufficiently 
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perfect to be taken as a standard for other tests. It has been proposed, how¬ 
ever, to administer an experimental set of tests of various abilities and per¬ 
sonality chai at tens tics to a strictly representative sample of 500,000 high- 
school students. The proposed sample would constitute peihaps 5 percent of 
the entne age gioup (ignoiing the lather small fiaction not in school). 

The developer of a new test for high-scliool ages could use these data as a 
basis foi standardizing. To jlo so, lie would select whichever of the experi¬ 
mental tests mctmucs about the same thing as Ins test does. Foi example, a 
new Form DD of the TMC could be calibrated against whatever test of 
mechanical comprehension is in the experimental batteiy Call this test X. 
The test developer would apply both test X and Foim DD to the same sam¬ 
ple, This sample should be reasonably representative of high-school students 
(foi example, it should not be lestneted to boys m a technical high school), 
but it can be fanly small and need not be exactly representative The 
equincrccniilo method is then used to establish what seoies on test X and 
Fonn DD repicsont the same level of ability. Semes falling at the same per¬ 
centile in the calibration sample aie taken as equivalent Suppose wc have 
the following information: 


Raw Score on 
Toil X 


Porcenlile Rank in 

National Sample Percentile Rank in 

for Grade 9 Calibration Sample 


80 98 99 

60 82 88 

40 63 70 


Raw Score on Form 
DD Having Same 
Percentile Rank in 
Calibration Sample 


60 

52 

43 


One would conclude, for example, that a score of 60 on Form DD is equiva¬ 
lent to a scoic of 80 on test X. We therefore would expect it to fall at the 98th 
percentile if it were slandaidived on a national sample of ninth-graders. Sim¬ 
ilarly, one can use the data foi lest X to establish norms foi other giades, 
students in technical curricula, girls in vocational courses, etc. Once 
Form DD scoios aie matched to scores on test X, any norms, expectancy 
tables, or otlici lesoareh on tost X can be used to intcrpiel Form DD Ceilain 
statistical eon returns may be necessaiy, however, unless Form DD and 
lest X are highly eoiielalod (Lindquist, 1951, pp. 750-760). 

Test norms become obsolete and need to be checked penodlcally. Re¬ 
search on the Weehsler intelligence tests, foi example, suggests that the 
scores of adults are, on the average, luglicx than those for similar ago groups 
a decade ago, These changes may be attributed to an increasing level of edu¬ 
cation, 

It is essential that norms be verified whencvei a test is altered Changes of 
items oi foimat can make the test easier oi harder and can even altci its 
meaning The Ciawfoid Structural Visualization test is made by cutting a 
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circular disk into nine pieces of irregular shape, The score is the time the 
subject takes in fitting the pieces together. Originally this test was made of 
heavy aluminum, After it had been m use for some years, the manufacturer 
began to use wood instead of metal. Psychologists who applied both old and 
new versions found that the mean time for the wooden test was 182 seconds, 
whereas the original test had a mean of 140 seconds (J, W Wilson and K. E 
Carpenter, 1948). The publisher had altered the test so that the published 
norms were now meaningless, it was a serious erroi not to revise the manual 

29. It might be appropriate to compare a high-school girl's performance on the 
Mechanical Reasoning test with girls' norms, boys' norms, or the combined 
norms, depending on the decision to be made. Illustrate. 

30. How do you explain the geographical differences on the SSCQT? Is it sound 
national policy to encourage more students from one region to attend college 
than from another? 

31. Assuming a normal distribution, what standard score corresponds to a raw 
score of 40 in each of the schools in Table 6? 

32. For Form CC, the difficult version of TMC intended for use in engineering 
schools, separate norms are reported for 148 engineering freshmen at Prince¬ 
ton, and for four groups at Iowa State College. 325 engineering freshmen, 175 
agricultural engineering freshmen, 121 sophomores in architectural engineer¬ 
ing, and all engineering seniors (108) It is reported that the subgroups of 
senior engineers were so similar that separate norms were not required How 
adequate are these norms? What other tables, if any, would be desirable? 

33. In a particular college which admits all high-school graduates who apply, the 
median score of the freshman class is at the 65th percentile of the published 
norms for freshmen for the Henmon-Nelson test What factors might account 
for this deviation? 

34 A psychologist standardizes a primary intelligence test by testing every child 
entering the first grade in San Francisco during a particular year. 

a. For what purposes would these norms be valuable? 

b. Could equally satisfactory norms be obtained without testing every first- 
grader? 

c. In what way would these norms be biased, as a sample of all 6-year-old 
children in San Francisco? 

35. A “music aptitude" test measures such factors as tone discrimination There is 
evidence that scores are increased by musical training. If the test is to be used 
for advising college freshmen whether to study music, what sort of cases should 
be used to establish national norms? 

36. How would you proceed to get an extremely representative sample of adult 
men in Chicago to use as a standardizing group for a mental test? Assume that 
you haye sufficient research funds to pay each man $2 00 for faking the test. 

37. Would local norms or national norms be most useful In interpreting each of 
the following? 

a. A personality test given to indicate whether a prisoner is psychotic. 

b. An intelligence test given to an infant considered for adoption. 

c. A reading test given to determine if a high-school boy needs individual 
remedial instruction. 
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Suggested Readings 

Bauemfeind, Robert It Are sex nouns nccessaiy? ]. counsel Psychol, 1956, 3, 
57-63 Wesnmn, Alcxandei G Separation of sex gioups m test reporting, 
/, educ. Psychol, 1949, 40, 223-229 

These aulhois aigue as to whether one should piesent sepmate noims for 
each sex in test manuals The two at tides should be compared to note the 
points where the wnleis agiee, and to deteinune why they come to diffident 
conclusions, 

Fioehlich, Chftoid P, & Dailey, John G Statistical methods of sunimair/rng the 
results of a single Lost oi measuring device Studying students. Chicago' Science 
Research Associates, 1952 Pp 12-38 

In extienlelv simple language the authnis explain the computation of standard 
deviations, pcicentilcs, and other statistics, and also discuss desnable qualities 
of test nouns, 

Loigc, living, & Thoindiko, Robert L Pioceduies foi establishing nouns Technical 
manual , Loigc-llwnulihc Intelligence Tests Boston. Houghton Mifflin, 1957 
Pp, 4-6 

This conipiessed summaiy describes the extensive reseaieh conducted to 
establish nouns foi one of the piominent modem mental tests Hie pioeedures 
used to select communities foi testing are unusually well designed Results 
show how nouns depend on the socioeconomic level of the community tested 

Seashoie, Untold G Methods of expiessing lest semes Test Sew Bull, 1955, 
No 45 (Available on inquest from the Psychological Corpoiation ) 

A dozen scales foi lepoitmg test semes are computed, including starunes, 
peicentiles, College Boaid semes, and so on 

Tinxler, Aillim E Adiiunisleimg and seonng the objective lest. In E F Lindquist 
(eel.), Educational measuiemrnt Washington American Council on Educa¬ 
tion, 1951 Pp 329-116 

Besides desenhmg, with excellent lllusti alions, all the majoi techniques used 
foi efficient seonng of tests both by hand and by machine, the chapter dis¬ 
cusses pioceduies fm assuimg that slandaid msliuctions ate followed. 

Traxler, Arlhui E The IBM test scoiing machine' an evaluation Proceedings, 
1953 Invitational Conference on Testing Pwhlcim Pmiceton Educational Test¬ 
ing Seivice, 1954 Pp 139-116 

Tiaxlei discusses the instorv and conlubiitioii of these machines, liusmg inci¬ 
dental questions as to llu* possible haim done liy fencing all laige-scale testing 
into the mold of five-choice multiple-choice items which fit the machine effi¬ 
ciently Cues a icalislic picture ol the piaclieal limitations of automatic test 
scoring, Othm papers in the same symposium desenbe supoideviecs some of 
which are still m the diawmg-boaul stage. 
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NEED FOR CRITICAL EVALUATION OF TESTS 

WHEN a teacher investigates the mental ability of his pupils, he looks for 
the best mental lest available An indnstiial psychologist selecting workeis 
for a factoiy wishes to try the best possible test of mental ability. The clinical 
psychologist studying a child who may be feeble-minded needs the mental 
test which will give the most accuiate results. Each of these users therefore 
asks, “What is the best test of mental ability?” But the test which best serves 
one of these testeis is probably not the best for either of the others. 

The purchasei of tests has a confusing pioblem. He is faced with long tests 
and short tests, famous tests and unfamiliar tests, old tests and new tests, or¬ 
dinary tests and novel tests The catalog of a leading test distributor offers 33 
tests of general mental ability and 19 tests of personality Each of these tests 
was pioduced by a psychologist who thinks his test is in some way superior 
to the otheis on the market. He is frequently correct 
| Different tests have different vutues, no one test in any field is “the best” 
| for all purposes No test maker can put into his test all desirable qualities. 
A change m design improves the test in one respect only by sacrificing some¬ 
thing else. Some tests work well with children but not with adults, some 
give piecise measures but requiie a long time; some give satisfactory geneial 
measuics but aie mfenor for detailed diagnosis. 

j Tests must always be selected for the particular purpose for which they are 
Ito be used. Even in similar situations the same tests may not bo appropuatc. 
Readiness of a child for first giade must be measured by different tests, de¬ 
pending on the instructional plan. Tests which select supervisors well in one 
plant prove valueless m another. And clinicians may have to choose different 
tests for each patient, No list of “recommended tests" can eliminate the 
necessity for carefully choosing tests to suit each situation, 

} The user of tests has constantly to evaluate new developments. New tests 
1 are produced, new uses of tests are discovered, and new findings about old 
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tests are brought to light, Any list of superior tests, therefore, soon becomes 
outdated Of the nine psychological tests most used in clinics in 1935 only 
two remained in the top nine m 1948. Only two of the newcomers to the list 
weie published later than 1935; the other five weie available in 1935 but 
then usefulness was oveilooked at that time (Louttit and Biowne, 1946). 
Bui os (1941, p. 11) made the following comment in selecting tests for discus¬ 
sion m Ins yearbook: 

The decision was made to include old as well as now tests Reviews of 
old tests may prove effective m eliminating fiom use many tests which 
were among the best m their day hut aic now outmoded and mferioi to 
recently constructed tests. On the othei hand, such leviews may lesult 
in mcieasing the use of old tests and testing techniques which compare 
veiy favorably with tests being currently published The sale of out¬ 
moded and evor-decieasmgly valid tests peisists fai beyond the sale of 
textbooks published in the same years 

| Prominence and popularity aie not necessarily signs of quality. In clinical 
‘ psychology and counseling particuhuly, fads in testing flomish. As Schafer 
says (1954, p. 6), "Because of its lapid giowth, a boom town excitement has 
characterized clinical psychology until veiy icecntly. News of a ‘good’ test, 
like news of sinking oil, lias hi ought a lush of diagnostic duffers fiom the 
old wells to the new and has quickly led to the formation of a new elite” of 
persons specializing in that lest. Techniques rushed into application far in 
advance of adequate research include piojectivc tests such as the Ror¬ 
schach, formulas for detecting biam damage fiom intelligence tests, and 
questionnaires such as the Taylor anxiety scale. The last-named, indeed, is 
a set of questions which the autlioi developed for use only in laboiatoiy rc- 
seaicli on learning, but clinical investigators seized the scale for diagnostic 
use without any evidence that the scale was supeiior foi that purpose Many 
of the fads in testing wane quickly, but some invalid tests hold thou “best- 
sellei” status for a gemmation 

The testing "industiy” of today had informal, even casual, beginnings. A 
psychologist or physician wanted to obscivo some type of nioloi, intellec¬ 
tual, oi emotional behavior and chose a stimulus oi task winch lie thought 
gave a good opportunity for observation, As lie mentioned his findings to 
others, they copied his technique m their own climes and laboratories, Soon 
there was a small maiket fen equipment (taehistoscopes for studying Hash 
pcieeption, blocks for tests like Kolis’, etc ) A few books were written be¬ 
tween 1910 and 1915, each descubing one invcstigaloi’s procedures, but 
there was no large-scale manufacture of tests. Test publication in the mod-, 
ern sense resulted fiom the great interest in clinical and educational testing 
after World War I, and particularly from the popularity of standardized 
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group tests The view then prevalent was that a test scoie could be inter¬ 
preted adequately only by comparison to national norms based on thousands 
of cases Users all over the country wished to puichase the same tests, pack¬ 
aged m a form which ensured uniform procedure, and accompanied by na¬ 
tional norms. 

One of the foices ci eating the Amencan testing industry of today is the 
decentralization of schools, clinics, and guidance services. Every school sys¬ 
tem is fiee to adopt tests or not, and to choose whichevei ones it piefeis 
Each counseling agency can puichase a diffeient set of tests, and sometimes 
each psychologist within the agency chooses tests foi himself This decen- 
tiahzation, combined with a demand foi caiefully developed instruments, 
piovides a competitive market which encouiages publication of tests in great 
numbei and gieat vanety With these tests available, an industiuil psycholo¬ 
gist rarely thinks it neccssaiy to make up new tests for his own factoiy 
Even a gieat national agency such as the Vetcians Administration idles on 
published tests foi its clinical piogiam. 

In Europe, competitive publication is almost unknown Them is, on the 
one hand, a tendency foi each clinical psychologist or each industiial psy¬ 
chologist to develop his own testing procedures—that is, to modily the 
methods his colleagues have used. School systems and guidance sei vices, on 
the othei hand, aie geneially undci centiali/ed national conhol. Each 
seivice theiefoie develops its own senes ol standardized tests, and the 
counscloi 01 local school adnnnistialoi has no choice A un tain amount of 
test publication is now beginning w Europe, the slock consisting hugely of 
translated Amencan and British tests. There aie also books by clinicians 
desenbing diagnostic procediues Piocedures aie not fully slandauh/ed, 
and little test research is published, this means that the pci son responsible 
for a testing piogiam in Euiope must take on even greater responsibility 
than his Amencan coun lei part 

Amencan test publication began in a small way A psychologist who had 
picpaied a test punted copies for gencuil sale, pcihaps tluough a finn sell¬ 
ing appaiatus to psychology laboialones As the demand for tests giew, par- 
Liculaily after Woxld Wai I, some textbook publishers began to handle tests, 
and some films specializing in school tests were established Until about 
1945, tlie typical test was developed by an aulhoi oi team ol authors who 
completed the test and then offeied it to the publishoi The publisher gave 
some assistance m the final stages of leseaich and m editing the test manual, 
but the mam scientific responsibility was the authoi’s. 

In lecent years, this situation has changed Expeuence made clear that 
satisfactory tests require long periods of development, following the best 
technical lesearch design Publishers and consumers began to examine moie 
critically the quality of test material and the technical information regarding 
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the effectiveness of the test. Today, authors are often discouraged from re¬ 
leasing new tests on which lesearcli is inadequate, even though tests of sim¬ 
ilar quality would have been accepted by most publishers twenty years ago. 
Test construction has increasingly been taken ovei by the test publishers, 
who have added technical staffs foi this purpose 

In addition to tests on the open nuuket, Americans aic making incieasing 
use of so-called -program" tests These aie tests developed to fit the needs of 
a particular huge pioginm Examples are an aptitude test developed for 
medical-college admission, another foi awarding college schoUu ships on a 
competitive basis, and a battery of aptitude and achievement measuies 
used m schools throughout a state foi guidance puiposes These piograms 
often require “secuie” tests whose questions can be kept secret fiom persons 
to be tested Tests for the programs aie sometimes developed by professional 
peisons employed by the testing agency. Mote often, the tests aie developed 
by the staff of a test publisher Many of the tests are constructed to resemble 
published tests which have aheady been standaidi/.ed and validated lho- 
gram tests should be developed .is caiefully as tests published to be used 
by the whole profession Teclinie.il information about the quality of the tests 
is not loadily available' to the general student of testing, which sometimes 
interferes with the evaluation of iese.ueh using pioginm tests 

Though there has been an increasing coneontialion ol responsibility in the 
hands of poisons well trained m test eotisliuelron, even the newei tests have 
marked limitations. Some of these limitations result only iiom the fact that 
no one test can do eveiyllnng, but some tests still aie published without 
adequate research and refinement Some, even popul.ii ones, do not succeed 
in measming what (hey weie intended to measure, and some measure char¬ 
acteristics other than what then lilies suggest. Fuilhermme, the author’s 
description oi a test undeistandably advertises its favorable features Even 
today, some test manuals semmsly mislead the uneiitieal reader, One recent 
aptitude battery foi vocational guidance was published with what seemed at 
fiist glance to he impressive evidence of validity, but nearly all the "evi¬ 
dence” consisted of validity coefficients foi an entnely different set of tests 
used in inihtaiv selettion. The only connection between the two butteries 
was a vague lesemblauie in plan. Since some published tests me ncnily 
woilliloss, and sine e otlieis evtiemely uselul lor one pm pose will not peiloirn 
well in anolhei situation, the usei must he able to choose among tests intelli¬ 
gently, 

Ability to judge tests is nnpoitanl lor many people who will never choose 
tests themselves The business executive may turn lus selection and promo¬ 
tion problems ovei to psychological consultants The psychiatrist or juvemle- 
couit judge may place full lesponsibility fm choice and interpretation of 
tests on clinical psychologists. Nonetheless, such consumers of test results 
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must know how tests are evaluated and be aware of the common weaknesses 
of tests Some industrial consultants recommend testing piograms which 
other psychologists legard as overelaborate 01 inadequately validated, the 
executive needs to know something about testing if he is to ask the right 
questions regarding their proposals There is an understandable tendency 
for the clinical tester to become overenthusiastic about the piocedures m 
which he is expert, and to make his recommendations too confidently. 

On the other hand, executives, psychiatrists, judges, and others who le- 
ceive repoits from psychologists fiequently depart fiom the recommenda¬ 
tions made Such departures, insofar as they take into account facts not avail¬ 
able to the tester, are necessary and justified, But giving great weight to 
supplementary impressions and little weight to objectively obseived be¬ 
havior spoils moie decisions than it helps. If the user of test information 
knows how tests are validated, he can decide when his own impiessions are 
substantial enough to be given comparable weight. 

1. "Improving a test in one way weakens it in another" What advantage, and 
what disadvantage, comes from each of the following changes? 
a. Lengthening a test, 
b Making it interesting to children, 
c Making it more diagnostic of strong and weak points, 
d. Giving it as an individual test instead of as a group test. 

• '2. This is a letter received by a psychologist from an industrial personnel manager 
hiring office and factory workers. How would you answer it on the basis of the 
paragraphs above, knowing that the tests mentioned are representative of their 
type? 

. . Just now we are planning the use of the following tests. Otis intelligence 
and Minnesota Multiphasic Personality Inventory, and aptitude tests related to 
our openings, such as the Bennett test Does this seem to be d well-balanced 
testing schedule for industry? Are there tests that you think preferable to these?" 

3. It has been suggested that the American Psychological Association set up a com¬ 
mittee to award a Sea! of Approval to oil well-prepared tests Discuss the ad¬ 
vantages and disadvantages of such a system Would this plan eliminate the 
need for critical judgment by users? 


The Test Manual 

(The manual (sometimes supplemented by a technical handbook) is the 
pimcipal source of information about the technical quality of a published 
test The manual is sold with the tests and piovides detailed dii actions, scor¬ 
ing piocedures, and lesearch findings’v 
Manuals are not always as useful as they should be. Some manuals omit 
Jacts which users need to judge the test, or gloss over unfavorable evidence. 
, ( Even a generally excellent manual may have some inadequate sections. 
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Faults are particularly frequent rn tests issued before 1945, because authors 
of older tests rarely prepared complete manuals or brought then manuals 
up-to-date. 

Prepaimg a good manual is difficult. The more research there is on the 
'test, the harder it is to summanze properly into a manual. The manual must 
be clear enough so that any qualified user can comprehend it—and so that 
the readei who is not qualified will realize that he is not Yet the material 
must be precise enough to satisfy specialists in lest research. 

The Technical Recommendations. A major aid m the piepaiation and use of 
test manuals is the Technical Recommendations published in 1954 Com¬ 
mittees of national organizations interested m measurement studied the 
problem of improving information about tests and prepared a lengthy set of 
Technical Recommendations for Psychological Tests and Diagnostic Tech¬ 
niques (1954). A supplementary statement dealing with problems of 
achievement testing was also prepared ( Technical Recommendations , 
1955). 

The Technical Recommendations indicate what the manual should con¬ 
tain. Many of the recommendations aie accompanied by examples illustrat¬ 
ing good oi pooi procedure. Figure 17 gives an extract fiom the Technical 
Recommendations to illustrate their form and content. This chapter and 
the next, on judging the quality of tests, discuss the aspects of tests with 
which the Technical Recommendations are concerned 

The recommendations aic used m several ways. Aulliois use them as a 
guide in wiiling manuals, and publishers use them in deciding when a lest 
is ready for leleasc. The recommendations draw the attention of test pur¬ 
chasers to points to consider m evaluating a test. 

Test Reviews The trend towaid improved test constiuclion and manuals 
was accelciated by the woik of Professor O. K Buios, who began to lelease 
critical reviews of tests rn 1934 These critical listings now take the foim of 
Mental Measurements Yearbooks, the most lecent of which appealed in 
1950, 1953, and 1959. 

Nearly all tests currently on the maikct, and some progiam tests, aie ic- 
viewed in ihe Bums senes. Each test is examined by two oi moio specialists 
chosen because of their practical experience and technical knowledge. Re¬ 
viewers discuss what each lest may best bo used for, and diaw attention to 
any questionable chums made in the test manual, Test leviews may also lie 
found in several journals, particularly Educational and Psychological Meas¬ 
urement and Journal of Consulting Psychology Although these leviews arc 
an aid to the purchaser of tests, lie must still judge tests foi himself He will 
find that leviewers sometimes disagree in judging a test, particularly when 
they approach it from different points of view. Sometimes a reviewer 
gives much attention to rather petty faults, and the reader must weigh these 
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criticisms against the merits of the test On the other hand, reviewers fail oc¬ 
casionally to notice faults Even with a well-balanced review of a particular 
test, the final decision to use it or not to use it depends on the specific situa¬ 
tion, which only the prospective usei knows. 

We have already discussed some of the qualities which make a test suita- 


F. Scales and Norms 

F 1. Scales used for reporting 
scores should be such as to increase 
the likelihood of accurate interpreta¬ 
tion and emphasis by test interpreter 
and subject ESSENTIAL 

[Comment Scales in which test scores 
are reported arc extremely varied Raw 
scores are used Relative scores are used 
Scales purporting to represent equal in¬ 
tervals with respect to some external di¬ 
mension (such as age) are used And so 
on It is unwise to discourage the devel¬ 
opment ot new scaling methods by insist¬ 
ing on one form of reporting On the 
other hand, many different systems are 
now used winch have no logical advan¬ 
tage, one over the other Recommenda¬ 
tions below that the number ot systems 
now used he reduced to a few with which 
testers can become familiar, are not in¬ 
tended to discourage the use of unique 
scales for special problems Suggestions 
as to preferable scales for general report¬ 
ing arc not intended to restrict use of 
other scales in research studies ] 

F 2 Where there is no compelling 
advantage to be obtained by report¬ 
ing scores in some other form, the 
manual should suggest reporting 
scores in terms of percentile equiva¬ 
lents or standard scores very de¬ 
sirable 

[Comment Professional opinion is di¬ 
vided on the question whether mental 
test scores should be reported in terms of 
sonic theoretical grow th scale, such as the 
intelligence quotient or the Henna index. 
Thus, a test developer who has ration¬ 


ale lor such scales as these should use 
them if he regards them as especially 
adequate. 

On the other hand, there is no theoreti¬ 
cal justification for scoring mental tests 
in terms of an "IQ" which is not derived 
in terms of the theory underlying the 
Bmet IQ and which has different statis¬ 
tical properties than the IQ docs Stand¬ 
ard or percentile scores w ould be prefer¬ 
able to arbitrarily defined IQ scales such 
as are used in the Otis Gamma and 
Wechslcr-Bellevue tests 

Strong recommends that Vocational 
Interest Blank scores be converted into 
letter grades where "A" indicates that 
at least two-thirds of the criterion group 
equaled or exceeded a given score, etc. 
He bases tins recommendation on the 
ground that finer score discriminations 
would lead only to unwarranted at¬ 
tempts at finer interpretative discrim¬ 
ination.] 

F 2.1 If grade norms arc provided, 
tables for converting scores to per¬ 
centiles (or standard scores) within 
each grade should also be provided. 

ESSENTIAL 

[Comment' At the high school level, 
norms within courses (e g , second year 
Spanish) may be more appropriate than 
norms within grades | 

F 3. Standard scores obtained by 
transforming scores so that they have 
a normal distribution and a fixed 
mean and standard deviation should 
\ in general be used In preference to 
other derived scores For some tests, 
there may be a substantial reason to 
choose some other type of derived 
score, very desirable 


FIG 17 A saetion from the Technical Recommendations (1954) 


ble or unsuitable In Chapter 1 attention was diawn to the necessity of se¬ 
lecting tests which the usei is competent to give and mterpi ct Chapters 3 
and 4 introduced other consideiations, including clarity of dnections, free¬ 
dom from coachabihty, convenience of scoring, objectivity, and adequacy of 
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norms, All these are important because they affect directly or indirectly the 
power of the test to improve decisions. 

The quality which most affects the value of the test, however, is its valid¬ 
ity. Validity is high if a test measures the light thing, i.e., if it gives the in¬ 
humation the decision maker needs No matter how satisfactory it is in other 
respects, a test which measures the wiong thing is worthless. We shall devote 
the remainder of this chapter to validity Otliei 1 datively less important fac¬ 
tors in choosing a test will be tieatcd in the next chapter, after which we 
shall consider as a whole the problem of choosing a test, 

TYPES OF VALIDITY 

A test which helps in making one decision may have no value at all for 
anothei. This means that wc cannot ask the general question “Is this a valid 
test?” The question to ask is "How valid is this test foi the decision 1 wish to 
make?” or more genomlly, “Foi what decisions is this test valid?” 

Very often, especially in selcc turn or classification, the decision is based on 
a poison’s expected future pcrfonnnnce as predicted from the test score, II 
these expectations are confirmed, the test has given highly useful informa¬ 
tion, hut if the predictions do not correspond to what happens later, the test 
was worthless. To know how validly the lest pi edicts, a follow-up study is 
required 

In selection or classification, the psychologist wants to maximize some out¬ 
come: job success, amount learned, obedience to law, otc lie gives a test, 
makes his picdictions, hies the tieatment suggested by those picchclions, 
and wails to see what happens He obtains a record of the outcome (foie- 
man’s lating, school grade, or number of court appearances, foi example) 
This recoid, which wc speak of as a cnlcrion, lie compares to the prediction* 
This is a siraiglrtloiward empirical 1 check on the value of the test The psy¬ 
chologist has determined what we call its predictive validity 

In many situations for winch tests arc developed, some more cumbersome 
method of collecting information is alteady in use If the existing method is 
considered useful lor decision making, the first question in validation is 
whether the newtej d. agiees w ith the present source of informa tion. If they 
disagiee, the lest may have value of its own, but it is ceilainly not a substi¬ 
tute for the original method Validation again loquncs an empirical com¬ 
parison. Both the test and the oiiginal procedure arc applied to the same sub¬ 
jects, and the results arc compared Foi example, tests intended for clinical 
diagnosis are eompaied with the judgments made by a psychiatrist who in¬ 
terviews each patient. A test of proficiency in radar maintenance may be 

1 An empirical method involves collection and analysis of data It is contrasted with 
purely logical methods of arriving at conclusions 
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compared with ratings given by an instructor who watches each man in the 
shop. This type of empiucal check on agreement is called concurrent vali¬ 
dation, because the two sources of information aie obtained at very nearly 
the same time.'(Figure 18). 

When tests are used to evaluate educational 01 theiapeutic piograms, a 




FIG 18. Predictive and concurrent validation compared 

different kind of validation may be needed The piogram is tiying to pioduce 
a ccitam change in behavioi, and therefoie, to evaluate the effectiveness of 
the program, the tester needs to moasuie just that type of behavior If a 
course is supposed to teach American geogiaphy, it would not be fan to 
^measure its effectiveness by a test on the geography of New England 
, The testei mtoiested m evaluation needs to ask, “Does this test represent the 
(content or activities I am hying to measuie?” Instead of computing scores 
on the .test with some other measure or judgment, as m empirical validation, 

| he must examine the items th emselves a nd compaie t hem with the content 
: he wishes to include. This piocess is called "content validation, | TKus~tlie 
' content vallcllfjrdrtlre geogiaphy test would have to be studied by checking 
the items against the couise of study the students have followed. 

The aforementioned types of validity are examined when a test is in¬ 
tended foi a specific practical use Sometimes, however, the test is used to 
anive at a description of the individual which will be used foi many pur¬ 
poses, or the test may measure outcomes foi scientific ralhei than immedi¬ 
ately piactical puiposcs In these applications, the test results are likely to be 
translated into geneial psychological terms. Instead of repenting that an ex¬ 
perimental tieatment “has incicascd the score on the Jones Lest,” the psy¬ 
chologist wants to make the broader interpi elation that “anxiety” has in¬ 
creased. The concept anxiety is pait of a psychological theory which tells 
what behavioi to expect fiom a person with gieat anxiety, under vanous con¬ 
ditions,'(whenever a tester asks wliat a scoie means psychologically or 
„ what causes a peison to get a ceitam test scoie, he is asking wliat concepts 
may piopeily beused t o mterp retthet est performance. This type of theoret- 
| ical concept is called a construct, and the piocess of validating such an in- 
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terpretation is called construct validation^fln order to show that a given 
construct applies to a test, it is necessaiy to d erive hypotheses about test b e- 
fro m the themv 1 elated to the construct and to venfy them experi¬ 
mentally. The thcoiy of “anxiety” accepted by the tester might include such 
expectations as the following: if poisons aic exposed to a thioat of electric 
shock, their anxiety will ineicaso, neuiotics are moie anxious than nonneu- 
mtics, anxiety is loweied by adnunistuition of a ceilam drug, anxious poi¬ 
sons have a high level of aspiiation. Each of these expectations can be tested 
by an experiment 01 a statistical study of group differences Determining 
construct validity is much more complex than the olhei types of valida¬ 
tion, as oui later discussion will show 
Table 8 summair/es the statements made to this point, 

With so many different ways to examine validity, each one applying to a 
paiticular use of the test, it is apparent that no test dovclopci can validate 
Ins test exhaustively The test usei cannot expect the manual to piovide com¬ 
plete evidence on validity, yet ho does not wish to use a test whose validity 
is uncoitain What can we legitimately demand of the test develop®? The 
Technical Recommendations indicate that he must assume the burden of 
proof whenever he leeommcnds the test foi a certain use. "The manual 
should lcpmt the validity of each type of mfeience fra which the tost is 
recommended ’’ Most tests have a few principal uses foi which their validity 
has been thoroughly studied, and this reseaiclr answers the questions of 
most test users. The usoi who wishes to apply the test in any olhei way may 
have to make his own validity studies.^ 

'/ No mattei how complete the lest author’s research, the poison who is de¬ 
veloping a selection or classification program must, m the end, confirm for 
himself the validity of the tests in his particular situation. And the poison 
who is evaluating a tunning program must determine the content validity of 
the tests foi this piogram. In this chapter xvc shall concentrate on under¬ 
standing the material presented m test manuals. Later (Chapter 12), it will 
be necessaiy to examine how the testei can conduct validation studies in 
Ins own situation. 

The Bennett TMG manual (Foirn AA) is faiily typical, though less exten¬ 
sive than many locent manuals. It sunnnaii/cs studies of predictive and con- 
cuucnt validity made m indusliy and military training. This inhumation in¬ 
dicates that the lest has considerable pi edit live value fra mechanical Lrades 
and engineering' In the manual their is no information on the lest as a pre¬ 
dictor of school and college grades—a sciious omission. Concunont coircla- 
iions of the TMC with several intelligence tests and with other mechanical 
aptitude tests aic rcpoited. This feature infoims the tester about the possibil¬ 
ity of substituting the TMC for one of the other tests, and also aids m inter- 
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preting the construct “mechanical comprehension.” Content validation is not 
required for the TMC 

How well does this information fill the counseloi’s needs? Counselors need 
to advise students regarding many questions of vocational specialization, yet 
only scatteicd validity studies aie available Indeed it would he impossible 
to conduct separate validations tor all the vocations counsclees will wish to 
considei. The list would have to covei aiehiteetme, aeronautics and hydrau¬ 
lics, metalworking and wood winking, design, constmction, maintenance, 
and so on ad infinitum. Even wheic a ptedictive study has been made for a 
specific occupation, one must recognize that not all jobs within the occupa¬ 
tion make the same demands The counselor theiefoie cannot hope to make 
definite piediotions 

The counseloi can mtcrpiet the seme only by knowing what “mechanical 
comprehension” signifies, as heie measured IIow much does it depend on 
specific tiainmg? Tins can be deteimined by learning how much semes in¬ 
crease during u shop coutse oi a phvsics eouiso. Does it apply solely to me¬ 
chanical-manipulative occupations, oi to all work that lcqunes masoning 
about foices and motion’-' This is answcicd by mtergiating the available pie- 
diction studies Aie individual dilleiences stable enough to justify long-iange 
predictions? This calls foi a long-teim follow-up. Docs mechanical eomjrre- 
hension promise skill in handling tools and machines? The unswci comes 
from llie companson ol TMC semes to scores on appaialus tests 

The Bennett manual does riot include all this information, Older tests, in 
general, weu* published without compiohcnsive validation, and even the 
best manual must leave some questions unanswered. The modern DAT 
manual, after 35 pages of validity data, concludes with a statement mging 
the counseloi to piepaie expectancy tables for courses in his own school and 
for jobs in his own community The test constructor is not expected to an- 
swer eveiy last question about validity bcfoie publishing his test, but he is 
expected to give the test usei a fair impression of its validity 

4. Would predictive or concurrent validity be studied in these situations'? 

a. The U S. Employment Service wishes to test men to determine which ones 
have had enough experience to be referred to contractors who have vacan¬ 
cies for eloctricians. 

b. A medical school wishes to test the personalities of its applicants to deter¬ 
mine which ones are best suited to a physician’s responsibilities 

c. A pencil-paper test is used to identify students entering junior high school 
who have emotional difficulties and should be singled out for counseling. 

5 A typing test which has excellent content validity for the original user may have 
poor content validity far some other user. Illustrate this statement 
6. Why would it be valuable to find out "what a test of pharmacy aptitude meas¬ 
ures," if we already know that if predicts success in pharmacy school? 
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PREDICTIVE AND CONCURRENT VALIDITY 
The Criterion 

An investigator studies piedictive validity when his primary mleiest is in 
some outcome. The outcome is what we want to improve by oui professional 
decisions: it is the employee’s pioduction on a job, the patient’s response to 
therapy, the counselee’s satisfaction with his life after counseling. The cii- 
terion is a record of the outcome. 

Foi example, suppose a wholesale haidware concern wants to hiie good 
salesmen and is tiymg a test to predict sales ability The outcome that in¬ 
terests the fiim is the sales each man will make. For research, this outcome 
has to be expressed m teims of some definite index of success. Perhaps 
“amount sold m six months” will seivc as a cuterion measme This result 
must be compaied to test scoies lecoided befoie the men were lined, to 
learn how much piedictive validity the test has If the test is unielated to 
the critenon, it is invalid foi selecting salesmen for this firm A single pie¬ 
dictive study docs little to claufy what psychological factors aic lepicsented 
m a test, but it does establish the test’s usefulness and limitations foi one 
practical situation. 

The gieatest difficulty in empirical studies is to obtain a suitable cuterion 
measure. If tlie index does not leally represent "selling success,” the test has 
not been given a fan tnal Let us look at the weaknesses of the ciiteiion sug¬ 
gested foi validating the salesmanship test. In the first place, it repicsents 
only the wholesale haidware business, so that at best we can judge the test 
for only tins one use, additional piedictive studies will be requiietl if llie test 
is considered for luring men to sell insurance 01 machine-tools. Although 
“amount sold” appears to be a fair basis for judging success, some men were 
assigned more desirable teiritory than otheis, so that sales do not reflect abil¬ 
ity alone Suppose we conti ol this by comparing each man’s sales with normal 
sales m his teniloiy. We still have not considcied the possible eflcct on 
business of vaiiable factois, such as pooi ciops m one legion. Still anolhci 
pioblein is that sales alone may not he what we desne fiom a salesman, 
A high-pressure salesman may build up high total sales on a first tup but, 
by oveiselling, ciealc pioblems which will eventually harm the film’s busi¬ 
ness 

A common type of ciiteiion is the latmg 01 grade. Aptitude tests are vali¬ 
dated against marks earned in school. Industrial predictors are validated 
against ratings by supervisors These latings aic rathei pooi cuteria because 
the judge often does not know the facts about the person and because judges 
disagiee When a test fails to piedict a latmg, it is hard to say whether this 
is the fault of the test or of the rating. 
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Concurrent validity is investigated when the test is proposed as a substi¬ 
tute for some othci information; this information is then the criterion. De¬ 
signers of new tests frequently establish concurrent validity for their instru¬ 
ments by comparing them to established tests New tests of intelligence, for 
example, have frequently been correlated with the Stanford-Binet intelli¬ 
gence test, whose predictive and construct validity have been studied ex¬ 
tensively. A test which agiees with the Binct test rneasuies “wliatevei the 
Bmet test measures” and may be relied upon for the same purposes. This 
pioccduie is helpful only if the test used as criteiion is meaningful and im¬ 
portant. Thcic is little value in knowing that three questionnaires of “neu¬ 
rotic tendency” agiee if none of the tests rneasuies anything save ability to 
see through the test and give “dcsnable” answeis. Likewise, a psychologist 
who distiusts psychiatric diagnoses would be hesitant to use them as a ente- 
rion foi a personality test. 

' 7. Criticize each of the following criteria 

a. Ratings of student teachers by their supervisors, as an index of teaching 
ability. 

b. Number of accidents a driver has per year, as an index of driver safety. 

e Number of accidents a driver has per thousand miles, as an index of driver 
safety. 

8. A test of preschool children is validated in three ways (1) Intelligence is 
defined as ability to learn responses with which one has had no previous 
experience. The test items are examined and found to fit this definition. 
(2) Scores on the test, given at age 3, are found to be related to reading skill 
and vocabulary knowledge at the end of the first grade. (3) Scores on the 
test, given at ago 3, are found to be related to scores on the Stanford-ffinet 
test given at age 16. 

*' „a. What possible uses of the test are warranted, on the basis of each of these 
studies? 

b. Would it be possible for a test to showV,high validity by method (2) and to 
lack validity according to the other two procedures? 

(^9. A study-habits inventory asks such questions as “Do you daydream when 
you should be studying?" 

a. What criterion would you use to determine empirically whether the inven¬ 
tory really measures study habits? 

b. What criterion would you use to determine whether the inventory predicts 
success in college? 

e. Which study would be best to show that the test is valid? 

10. Criticize the procedure indicated in the following report of a study of success of 
teachers college students (citod by Eckelberry, 1947). 

"The correlation between all thirty of the | predictor] variables and the 
[school] superintendents’ ratings was only .17, but that between the variables 
and marks earned during four years of college was 79 Since college marks 
were predictable on the basis of the thirty variables and . the superintend¬ 
ents' ratings were not, the marks were substituted for the ratings as a criterion 
of success." 
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i Correlation Coefficients 

A study of predictive or concurrent validity is nearly always lepoited m 
terms of a cori elation coefficient. This is a statistical summary of the relation¬ 
ship between two vanables, ancl it plays a fundamental part in test lesearch. 
It is the most common method for repoi ting the answer to such questions as 
the following. Does this test pi edict performance on the job? Do these two 
tests measure the same thing? Do scores people made on this tost a yeiu ago 
agree with the scores they make now? 

To illustrate correlation, let us consider ten hardwaic salesmen who were 
given three tests when hned. After six months, when the criterion lecords 
aie m, we have the infoimation m the left portion of Table 9 The problem is 


TABLE 9 Data on Ten Hardware Salesmen 


Test Scores 

Salesman Test 1 Test 2 Test 3 

Criterion 

Measure 

Criterion 

Rank 

1 

Test Rank 
2 

3 

A 

30 

45 

34 

$25,000 

6 

4 

7 

7 

B 

34 

64 

35 

38,000 

2 

2 

3 

5 A 

C 

32 

32 

35 

30,000 

4 

3 

9 

514 

D 

47 

52 

31 

40,000 

1 

1 

5 

9 

E 

20 

74 

36 

7,000 

10 

9 

1 

4 

F 

24 

50 

40 

10,000 

9 

7 

6 

1 

G 

27 

53 

37 

22,000 

7 

5 

4 

3 

H 

25 

36 

30 

35,000 

3 

6 

8 

10 

1 

22 

71 

32 

28,000 

5 

8 

2 

8 

J 

16 

28 

39 

12,000 

8 

10 

10 

2 


to judge which test is the best piedictoi, The test scores aie haid to exam¬ 
ine in “law” foim, since each test has a different aveiage 
One way to simplify the data is to change them to ranks, as in the right 
poition of Table 9 (Note that when two men tic on test 3, we give them 
the lank halfway between the positions which the paii occupies ) Now we 
see that E, pooiest on the cnleiion, has vciy low lank on test 1, high on 2, 
avciagc on 3 Man F, also poor as a salesman, is below the median on 1 and 
2, but at the top in 3 Befoic leading ahead, study Table 9 to decide how 
valid each lest is foi selecting haidwaie salesmen 

Rank Correlation. To obtain a single estimate of the goodness of each lest, 
we compute a conclahon A simple procediue, useful loi studies involving 
lew cases, is the rank-chffeience conelation (Below, we shall show the 
pioduct-moment technique, the more complicated computation that is 
most used m test leseaich ) The symbol p (the Gieck lettei iho) is used foi 
a rank-difference conelation coefficient In Computing Guide 4, we show the 
steps in deteiminmg ^ lf comparing test I with the critenon c 
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When the computations for all three tests are performed, we have these 
correlations between test and criterion 


t> ic — .782 

P2r ~ ~ .090 
= - 754 

A positive coefficient shows that high standing on the test goes with high 
standing on the uitenon. A negative coefficient shows that high standing on 
the test goes with low standing on the criterion 


1 Begin with the pairs of scores to be studied 

2 Rank men from 1 to N (number of men) in 
each set of scores (Note that the lowest man 
must have rank N, unless he ties with some¬ 
one ) 

3. Subtract the rank In the right-hand column 
from the one in the left-hand column. This 
gives the difference D (As a check, make 
sure that this column adds to zero ) 

4 Square each difference to get D 5 

5. Sum this column to get ID 2 

6. Apply the formula 


p (rho) = 


6(zD 2 ) 
N(N 2 - 1) 


Example 

Man A (x = 30, c = 25,000) 
Man A has ranks 4, 6 
N = 10 


Man A 4-6 = — 2 


Man A- (~2) 2 = —2 x — 2 = 4 


6(36)_ 

i ofi 66*- i) 


= i 


216 

990 


= 1-218 
p = 782 


Man 

Test 

Scores 

Criterion(c) 

Test 

Ranks 

Criterion 

Rank 

Differ- 

ence(D) 

Squared 

Differ- 

ence(D 2 ) 

A 

30 

$25,000 

4 

6 

-2 

4 

B 

34 

38,000 

2 

2 

0 

0 

C 

32 

30,000 

3 

4 

-1 

1 

D 

47 

40,000 

1 

1 

0 

0 

E 

20 

7,000 

9 

10 

-1 

1 

F 

24 

10,000 

7 

9 

-2 

4 

G 

27 

22,000 

5 

7 

-2 

4 

H 

25 

35,000 

6 

3 

3 

9 

1 

22 

28,000 

8 

5 

3 

9 

J 

16 

12,000 

10 

8 

2 

4 






2D = 0 2D 2 

= 36 


COMPUTING GUIDE 4 RANK-DIFFERENCE CORRELATION 
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A zero coefficient means that one cannot predict the cutenon from the test 
A coirelation of 1.00 or —1 00 shows perfect relationship; when this occurs, 
the cuteiion score (or rank) can be predicted exactly. Test 1 identified the 
best salesman and the second-best accurately, but the thiid-best salesman 
ranked sixth on the test, which loweied the coirelation The larger the corre¬ 
lation, whether positive or negative, the more accurate the prediction: On 
test S, a low score picks out a superior salesman. Fiona these data we con¬ 
clude that either test 1 oi test 3 is a good predictor for this firm, 

11. Compute p 2 o and p 3o . 

12. Obtain a combined score by subtracting score 3 fram score 1 for each man. 
Correlate this with the criterion. Would it improve prediction to use both tests? 

Product-Moment Correlation. ^Although haider to learn, the product-moment 
technique foi computing conelation is easier to apply to laige groups than 
the rank method. The rank formula is equivalent to computing the product- 
moment coirelation between lanks. 

pioduct-moment couelation (r) may be deteinnned from a “scatter 
diagram" obtained by plotting pairs of scoies. The scores of Tabic 9 can be 
put in this form. We set up a chart, with the fust vailablc (test scoic) along 
the hoiizontal axis and the second vauable (sales) along the other (see Fig- 
me 19). Man A is plotted above 30 on the x-axis (test 1), and opposite $25,- 



FIG 19 Scatter diagram for test 1 and criterion c 


50 




VALIDITY 113 


000 on the y -axis (criterion) We can observe how criterion scores correspond 
to test scores As score 1 lises, c tends to rise. 

At the end of the chapter is a computing guide showing how one ob¬ 
tains a pioduct-moment con elation from the scatter diagiam. It is not neces¬ 
sary to learn this procedure in oidei to inteipreL coefficients 
\Jday statistic has a certain vaiiation fiom one sample to anothei. Even if 
groups of subjects are drawn at random fiom the same population, the corre¬ 
lation coefficients between two vaiiables will differ from sample to sample. 
Using a large sample of couisc makes the correlation moic dependable. The 
fluctuation of correlations fiom sample to sample may be consideiable^If the 
coirelation of two scoies m a large population is .SO, m ten random samples 
of 100 cases each the pioduct-moment conelations would vaiy thus- .17, 
.47, 34, 31, 24, .39, 20, .25, 28, 45 If samples aie not landom, but come 
from diifeient films or communities, the fluctuation of coefficients will be 
even greater. 

13 . Prepare scatter diagrams relating tests 2 and 3 to the criterion 

14 . How much would the rank correlation in Computing Guide 4 change if person 
J had been replaced by a person with score 21 and criterion $26,000? 

Meaning of Correlations IIow well one vainible picdicts anothei is shown 
by the scatter duigiatn. Figme 20 shows scatter diagiams eonesponding to 
vanous sizes of cocffii iont. When r — 1 00, one vaiiable is predicted pcifectly 
fiom the other. With r — .60, prediction is only appioximalc People who 
stand at 8 on X aveiage near 7 on Y, but they spicad fiom 3 to 9 An em¬ 
ployer wishing not to lose any applicant whose Y scoio is 8 or belter would 
have to hire everyone with an X score of 4 or better JrTediction becomes pro¬ 
gressively poorer as the scatter diagram becomes “fatter” 

Another way of considering the meaning of coirelation is to tianslate the 
scatter diagram into an expectancy chait. The expectancy charts shown in 
Figure 10 (p 73) conespond to test-ciilerion correlations of 51 foi mechani¬ 
cal aptitude, .47 for trade infoimalion, and 26 for the nul-and-bolt test 

When the coirelation is less than 1.00, one measuie is influenced by some 
factoi not found in the olliei measuie Random errois of measurement lowei 
conolation So do causal faclois not involved equally m both vaiiables. Foi 
example, the coirelation between intelligence and school maiks is only mod¬ 
erate because many faclois besides mental ability influence the maiks pupil 
effoil, teacher bias, piovious school learning, health, and so on. 

It is incoueet to inleipiel high conolation as showing that one variable 
“causes” the other 141010 arc at least three possible explanations for a high 
con elation between variables A and B A may cause oi influence the size of*' 
B, B may cause A, or both A and B may be influenced by some common fac¬ 
tor or factors. The coirelation between vocabulary and reading may be 







FIG 20 Scatter diagrams yielding correlations of various sizes 
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taken as an example Does good vocabulary cause one to be a good reader? 
Possibly Or does ability to read well cause one to ucqune a good vocabulary? 
An equally likely explanation. But to some extent both scoies lesult fiom 
high intelligence, a home m which books and senous eonveisation abound, 
01 superior teaching in the elementaiy schools, Only a tlieoielical under¬ 
standing of the. processes involved, 01 contiolled expeinnents, peimits us to 
state what cause's underlie a particular correlation Without this, the only 
safe conclusion is that con elated ineasiues are influenced by a common fac¬ 
tor. 

15 How large a correlation would you anticipate between the following pairs of 
variables? 

a. Age and annual income of men aged 20 to 50 
b Age in January, 1930, and age in March, 1950 

c. Scores on two intelligence tests, given the same week 

d. Annual income and number of children, among married urban men 

e. Maximum and minimum temperature in Wichita, each day for a year 

16. What is the expectancy of earning above average on Y, if a person has a score 
of 8 on X? Determine this for each value of r in Figure 20 

17 What possible causal relations might underlie each of the following corre¬ 
lations? 

a. Between amount of education and annual income of adults (assume that r 
is positive) 

b. Between average intelligence of children and size of family (assume that 
r is negative) 

c. Between Sunday-school attendance and honesty of behavior (assume that 
r is positive). 

18 Beginning with the information in Figure 20, prepare an "expectancy table" 
similar to Table 1, corresponding to each of the following values of r: 1 00, 
.90, 40, 20 

Typical Validity Coefficients 

\^2orrolations between tost and cntenon are called validity coefficients 
Talile 10 lists some fanly typical coefficients of piedictive and concuucnt 
validity, taken, m each ease, fiom the lest manual Some lesl-ciitenon com¬ 
binations yield much gieater validity than others. The vaiiation m results 
for the Slim l Employment Tests should be paitieukuly noted in Table 10 IT 
is veiy unusual lor a validity coefficient to list; above 60, which is fai horn 
peifeet picdictuni 

Although we would like liighei coefficients, any positive coirelation indi¬ 
cates that piediiturns bom the test will be moie accurate than guesses 
V/Whether a validity coefficient is lngh enough to wan ant use of the test as a 
piechetoi depends on such practical considerations as the uigency of nn- 
pioved prediction, the cost of testing, and the cost and validity of the selee- 
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tion methods already in use. To the question “What is a good validity coeffi¬ 
cient?” the only sensible answer is, “The best you can get.” If a critenon can 
be predicted only with validity .20, the test may still make an appieciable 
practical contribution. Naturally, a gieater contribution is required to justify 
an expensive, inconvenient pioceduie than an inexpensive one. 

In inteipreting any correlation coefficient, theiange of the group studied 


TABLE 10. Illustrative Validity Coefficients 


Test 

Sample 

Criterion 

Type of 
Validity 

Coefficient 

California Short-Form 

100 children re- 

Wechsler Individ- 

Concurrent 

.77 for total 

Test of Mental Matu- 

ferred to a 

ual test 


score 

rity 

guidance de- 





partment 




Gordon Personal Profile 

122 college stu- 

Ratings of person- 

Concurrent 

.49 to 73 


dents 

alify by dormi- 


for four 



tory mates 


subscores 

Iowa Tests of Educa- 

634 students in six 

Grade point 

Predictive 

58 

tlonal Development 

Iowa colleges 

averages as 




tested in grade 

college fresh- 




9 

men 



Short Employment Tests 

51 operators of 

Supervisor’s merit 

Concurrent 


Verbal 

proof machines 

ratings of |ob 


.15 

Number 

in a large bank 

performance 


.25 

Clerical 




37 

Short Employment Tests 

80 skilled opera- 

Records of pro- 

Not slated 


Verbal 

tors of book- 

duction on ten 


10 

Number 

keeping ma- 

days 


26 

Clerical 

chines m a bank 



.34 

Short Employment Tests. 

262 students in a 

Satisfactory com- 

Predictive 


Verbal 

one-year secre- 

pletion of 


15 

Number 

tarial training 

course vs non- 


48 

Clerical 

course 

completion or 


47 



noncertification 



Short Employment Tests 

52 stenographers 

Ratings of job 

Predictive 


Verbal 

and clerks in an 

performance 


45 

Number 

industrial con- 



08 

Clerical 

cern 



.31 


must be considered The conelation is smaller in a select gioup than in a 
gioup containing a wide nmge of ability. High-,school achievement would 
pi edict college marks with a validity much above .60 if all those with poor 
reeoids went on to college The validity of the Iowa Tests for advising pupils 
whcthei to plan on going to college is lnghci than then validity for selecting 
among the high-school giaduates who apply for college admission, because 
the lattei group is already restricted (Foi fuithci discussion of this point, 
see p. 351 ) 

' Relation of Concurrent Validity to Predictive Validity. If a user needs a test 
badly, he may want to employ it for prediction even before the evidence 
on its piedictive validity has been accumulated. Indeed, when a test such as 
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the Strong Vocational Interest Blank is invented to predict whether an 
adolescent will enjoy a caieex as a physician, the ciitenon data cannot be ob¬ 
tained for some twenty yeais. This would sometimes mean a very long delay 
between the invention of a test and its piaclical use, if we insisted on waiting 
for the predictive coefficient before applying the test piedictively. v-- 

Concurrent validity can be detci mined at once, and may shed some light 
on the piobable predictive validity of the test. Stiong published his test m 
1928, offeung as evidence of validity the fact that inteiest scores distin¬ 
guished men of different occupations fiom each other. For instance, the 
Physician scores of doctors averaged much higher than those of nondoctors. 
The purpose of the test is not to find out whether a man is at piesent a doc¬ 
tor, it is to find out if a young man will, as he grows older, be satisfied with 
that career. If the direction of a man’s inteiests at age 40 is the same as at 
20, then the coneuncnt validation based on older men does show that the 
test can safely be used to give vocational advice at age 20 Until long-teim 
follow-up studies were made, users of the Sluing test had to assume stability 
of mleiests. After publication of the test, Stiong continued to accumulate 
evidence by following adolescents foi twenty yeais or more and by 1954 was 
able to vonfy that the test indeed piedicts vocational status ovei a long 
period. 

A concuirent-validation piocedure may be employed for almost any pre¬ 
dictive test, by adnmusteiing it to persons whose criterion pciformance can 
be observed immediately An aptitude test for medical students may be 
given at the time of giaduation fiom medical school, grades being used as a 
criterion. A test intended to identify potential neurotics may be evaluated 
by detennining whethci it distinguishes present neurotic cases fiom some 
nonneurotic gioup such as medical patients coming to the same clinic. Kohs 
(1923, p 182) offered a conciurent validation when he repoited a correlation 
of .80 between Block Design IQs and Stanford-Binet IQs at the time he re¬ 
leased Ins test. Almost never do we find research reports in which concunent 
and predictive validities arc determined under the same conditions, so that 
we cannot say just how much they are likely to differ A reasonably close 
comparison may be made between the following coiiolalions of educational 
proficiency tests with college giades in coiresponding couises (Drcssel and 
Schmid, 1951): 

Concurrent Predictive 

English 61 .55 

Social Studies 79 30 

Science 68 .49 

The concurrent correlations were obtained at Michigan State University and 
the predictive correlations at Dartmouth, and it may be that the smaller 
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range of ability at Dartmouth accounts for part of the decline m the va¬ 
lidity coefficients. 

19. Which of the following describes concurrent validity, and which describes 
predictive validity? In which instances has concurrent validity been measured 
even though the test appears to be intended for predictive purposes? 

a. The Short Employment Tests are found to correlate 91 with the General 
Clerical Test, which has been used for some time as a predictor of job 
success. 

b The manual for the Henmon-Nelson Test of Mental Ability reports corre¬ 
lations with school grades assigned one month later 
c A correlation is calculated to determine how well a certain test distinguishes 
patients diagnosed as schizophrenic from those diagnosed as brain-dam¬ 
aged 

d School records of delinquents and nondelinquents in high school are 
searched to learn what scores collected in the elementary grades correlate 
with delinquent status. 

20 A test of ability to understand spoken words is validated by administering it to 
first-graders at the end of the year, and correlating it with their present read¬ 
ing ability The coefficient is fairly high. Would you expect a similarly high 
predictive-validity coefficient 

a. if the test is used at the start of the first grade to predict end-of-year read¬ 
ing? 

b. if the test is used at the end of the first grade to predict response of poor 
readers to a special remedial reading program? 

c. if the test is used with 4-year-oids, to predict later success in Grade 1 
reading? 

21 Would the attitude of present employees, taking a test in a concurrent- 
validation experiment, be the same as the attitude of applicants taking the 
test? 

Example of Validity Information Let us now examine some of the data on 
validity given m the DAT manual, for that version of the TMC The data of- 
feied include corielations with subsequent couise grades, a foui-yeai fol¬ 
low-up of school achievement, and a follow-up of post-lugh-sehool careers. 
Only a bnef extiact can be considered hero. 

Table 11 gives some of the coefficients lelating mechanical emnpiolionsion 
to couisc grades in science and shop It is obvious that one cannot speak of 
‘‘the validity” of a test for a certain field, save as a slunlhand expulsion foi 
a general trend The vaiiation of coefficients is gioal, even hum gioup to 
gioup in the same school There au; many explanations loi tins' sampling 
fluctuations, differences m couise content, differences in lelialnhly ol guid¬ 
ing, diffeiences m level of ability, etc Ghiselli (1955, p. 112) repoits similar 
radical fluctuations among validity coefficients for tests of mechanical com¬ 
prehension against training criteria for repairmen m industiy He tabulated 
ovei 100 coefficients fiom various studies, finding a range fiom — 30 to + 60 
Eleven studies reported validity coefficients above .50, and fourteen found 
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coefficients below 20 Tins finding is not peculiai to the mechanical reason¬ 
ing test. Whenevei numeious coefficients are obtained for the same test and 
the same typo of criterion, vanation such as this is found Only where condi¬ 
tions aie highly standardized is a second validity coefficient likely to dupli- 

TABLE 11. Validity of the TMC as a Predictor of Course Grades for Boys 


Course 

Grade 

location 

Time Between 
Test and Marks 

Number 
of Cases 

Validity 

Coefficient 

Industrial Arts 

9 

Mt Vernon, N Y. 

1 year 

67 

39 


9 

Worcester, Mass 

4 months 

89 

.20 


9 

Worcester, Mass. 

1 year 

79 

05 

Woodworking 

10 

Independence, Mo. 

3 months 

42 

.30 

General Science 

9 

Mt Vernon, N.Y 

1 year 

84 

19 


9 

Columbia, Mo 

8 months 

88 

50 

Physics 

11 

Schenectady, N Y 

3) years 

42 

.47 


— 

White Plains, N Y 

1-2 years 

41 

41 


Souuu Hewlett it at, l'J'if), pp M fF 


cate a first. Wheie titiming follows a unilonn plan, where the level of ability 
is held constant by selection, and wheie the enteiion is based on objectively 
measuicd peilormanco, validity coefficients are as stable as the size of the 
sample allows Such a coefficient, however, may not confidently be assumed 
to apply to anothoi setting. 

In most piedietivo uses of tests, the published validity coefficient is no 
moio than a hint as to whether the test is lelovunl to the tester’s decision. He 
must validate the test in lus own school oi factoiy, and even then he can ex¬ 
pect coefficients to fluctuate Foi this leason, testers aie usually fozeed hack 
upon a psychological rather than a puiely statistical use of scores. 

While a test publisher may be expected to include icpresentative validity 
studies in the manual, much fmthei evidence accumulates after the test is 
distributed Only a thorough seaich of professional journals can locate all 
this information. The mdustnal psychologist can find many of the studies 
relevant to his selection problems m the “Validity Information Exchange” 
published quarterly m Pci sound Psychology. One particular issue (Autumn, 
1954), foi example, piescnls thirteen different studies, among which three 
use the Bennett lest We learn that among policemen m St Louis, Du Bois 
and Watson obtained correlations for Foirn BB of about 28 with tunning 
grades and maiksmunship, .20 with an achievement test, and 10 with rating 
on duty. (No other test was belter able to jrredict the rating.) Bruce found 
no useful relation between Form AA and ratings of foiemen in a tobacco 
plant, and McCarthy a correlation of — 10 with ratings of foiemen in a fac¬ 
tory which makes electrical equipment. Another useful source is the Dorcus- 
Jones Handbook (1950), which abstracts 426 studies on employee selection 
published prior to 1949. It is evident that the test manual can never exhaus¬ 
tively summarize or integrate such a varied literature. 
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by a small amount Diffeiences between persons aie much the same before 
and after the course. 

It is necessary not merely to identify an influence but to find out how 
strong it is The wnter once suspected that much of the TMC seoie de¬ 
pended on knowledge of a few specific principles (e.g , gcais, levers), each 
of which was involved m several items But when separate scores were ob¬ 
tained on each type of item, these scores con elated highly. Since a person 
high on gear problems was high on other items, it was unnecessary to intro¬ 
duce the concept of specific subtypes of mechanical comprehension Such 
specific knowledge could account for only a small amount of the difference 
among persons 

It is already evident that no single type of reseaich is used m construct 
validation. We can give a biief tabulation of proceduies merely to indicate 
the diversity of methods and describe the lelevanee of each method to the 
TMC 

• Examination of items This is sufficient to mle out some explanations, 
thus it is easily seen that neither arithmetic nor voibal reasoning affects 
scoies But it is also seen that the machines used are those common m 
Western culture, not in pimntive Afnea, this reminds us lo cons id ei cultmal 
backgiound m mleiprcting the test outside industrial nations 

® Administration of test to individuals who “think aloud.” This may show 
that m some items quite inelcvant featuies of the lest (eg., an obseme 
drawing) affect the score It may show that some people succeed by an 
intuitive peiception of answeis which others loach by painstaking logic 
This would suggest that the scoie means diffeienl things foi diffeicut persons, 

• Coirelation with piactical cuteiia. Learning what eouises 01 jobs the 
TMC predicts clanfies what types of mechanical woik it applies to 

® Correlation with othei tests (and factor analysis) If the TMC cot- 
lelates highly with a general intelligence test, it need not be mteipreted in 
terms of a special mechanical aptitude As a mattei of fact, it does depend 
to a substantial dcgiee, but by no means entucly, on geneial mental ability. 

® Internal conelations The study of sepaiale types of items described 
above is of this type 

• Studies of group diffeiences. The comparison of boys and gnls is an 
example 

® Studies of the effect of tieatment on scores Ti tuning m physics pioved 
not to affect the TMC gieatly 

» Stability of scoies on ictcst II scoies aie unstable, one could not 
mteipret mechanical compiehension as a lasting, vocationally significant 
aptitude An obtained correlation of 69 between ninth- and twelflh-giadc 
scores for boys promises a reasonable degiee of stability, but also shows 
that this aptitude is far from a fixed quantity 
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Tlie user of the test wants to know how the test can be interpreted, and 
how confidently. The manual should indicate what interpretation die author 
advises, and should summarize the available evidence from all types of stu¬ 
dies relevant to this inteipietation. If the user wishes to make some other 
interpretation, he must examine all the evidence on the test in the light of 
lus own theory. 


27. Kohs (1923, pp. 168 ff.) wished to argue that the Block Design test measured 
"intelligence," defined as "ability to analyze and synthesize " He then offered 
the following types of evidence (plus others) for his claim How does each of 
these bear on construct validity? (The Stanford-Bmet test was at that time 
recognized as the best available measure of intelligence but was thought pos¬ 
sibly to depend too heavily on verbal ability and school training.) 
a. Logical analysis of the "mental processes” required by the items 


b Increase in average score with each year of age. 

c. Correlations as follows 

Binet score with age 80 

BD score with age 66 

BD score with Binet score 81 

d. Correlations. 

Binet score with teachers' estimates of intelligence 47 
BD score with teachers' estimates of intelligence 23 

e. Correlations 

Binet score with vocabulary 91 

BD score with vocabulary 77 

f. Correlations between successive trials 

on Binet, 91, on BD, .84 


28. Which of the variables in Kohs' study are acceptable as criteria of pure in 
telligence? 


Suggested Readings 


French, John W Validation of new item tvpes against lour-year academic catena 
7 educ Pstjdiol , 1058. 49, 37-76 

This piediclivc study eonipauss chffeienl types of tests foi college applicant 
in terms of then powci to pi edict guides and successful completion of col¬ 
lege work The study is unusual because of the huge nunibei ol meusuies 
used, the laigc sample in each college, and the lepctition of the expoiiinont 
in many colleges Note pailiculailv (ho dcgice to which results diffei toi 
difleionl catena and diUc-ionl colleges 

Peak, Helen Piohlcrns of objective observation In Leon Festinger & Daniel Kat/ 
(eds,), Research methods in the behavioral sciences New York' Dryden, 195.3 
Pp 243-299 

This chaptei, directed toward the social scientist choosing a measurement 
proceduie for a research project, discusses the qualities which make a proce- 
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dure satisfactory. Dr Peak outlines many methods used in establishing con¬ 
struct validity. 

Thorndike, Robert L. The estimation of test validity criteiia of proficiency. Per¬ 
sonnel selection New York: Wiley, 1949, Pp. 119-159. 

Thorndike desciibes the vanous types of measures that may be used as en- 
tena, particulaily in industrial applications of tests. 

Validity Technical recommendations for psychological tests and diagnostic tech- 


1 Begin with the pairs of raw 

X 

Y 

X 

Y 

X 

Y 

X 

Y 

X 

Y 

scores to be studied 

24 

35 

27 

38 

26 

39 

29 

35 

30 

42 


25 

39 

28 

37 

30 

39 

24 

38 

28 

37 


24 

39 

29 

36 

32 

40 

17 

24 

30 

39 


25 

36 

19 

34 

30 

42 

29 

38 

26 

37 

2. Tabulate the points in a 

31 

43 

28 

37 

25 

38 

29 

38 

26 

39 

scatter diagram, entering 

22 

38 

27 

32 

32 

43 

27 

36 

23 

37 

one tally for each pair of 

30 

43 

25 

38 

26 

37 

30 

39 

20 

29 

scores. (The first pair [24- 

24 

35 

30 

41 

24 

36 

26 

40 

25 

38 

35] is tabulated in the cell 

25 

40 

31 

41 

21 

32 

25 

33 

15 

31 


above 24 on the X scale, 
and opposite 35 on the Y 
scale This cell is outlined 
in the illustration.) 
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niques Washington: American Psychological Association, 1954. Pp, 13-28. 
(Psychol . Bull, 1954, 51, Supplement) 

This section of the recommendations introduces the four types of validity, 
recommends types of infoimalion about validity to be included m test man¬ 
uals, and gives specific examples of good and bad practice It is suggested 
that the reader compare tile manual of some recent test with these recom¬ 
mendations, 


3. Count the number of tallies in each 
column, and write it below the diagram 
m a row labeled f* Count the number 
in each row, and write it beside the 
diagram in a column labeled f v 


4 Select an arbitrary origin for X and for 

A O , 

= 26.0 

A 0„ = 

Y, and determine the mean and stand¬ 

Cl 

= 20 

c„ = 

ard deviation for oach as in Computing 

Mr 

= 26 2 

M v = 

Guide 2 (computation not shown) 

St 

= 3.83 

*11 — 


5 In each cell of the scatter diagram, 
multiply the number of tallies by the 
value of d x written below that column, 
and write the product in the cell. (In the 
outlined cell, for instance, there are two 
tallies, and d, is —2; the product u —4) 

In each row, add the numbers written In 
the cells, and place in a column labeled 

fyd. 

Multiply each entry in this column by d v 
and enter in a column labeled fd z d u 

Add the column fd x d v . 

Substitute the numbers In the following 
formula! 


f'xV ” 


Jlfchd„ 

N 


Cx Cy 


S x S„ 


473 

45 


.06 


r * v 3 83 3.72 
10.51 - 06 
r * 1 ' ~ 14.24 

r„ = .73 


10 45 
14,24 


COMPUTING GUIDE 5. COMPUTING THE PRODUCT-MOMENT CORRELATION COEFFICIENT 
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How to Choose Tests 


THE fiist important consideration in choosing a lest is its validity. As wo 
have seen m the preceding chaptei, validity inhumation pounds us to judge 
whether the test meas jui [ > .-A 1 _ 1 a ■ , • ' , lulily is ex¬ 

amined by coi npanng scores to an external cnterion, hv eompaung ilemx to 
a specified body o f con tent, oi by establishing an explanation ol semes m 
fefmFof geneial conxtiucts Thcie are many additional qualities to eomidei 
i’n~cIioosihg' a test, some related to its statistical piopcilies and some to its 
piactical featuies 

RELIABILITY 

Reliability studies give information about the consistency of a person’s 
scenes on a series of measurements. For example, Bennett icpoits that on 
the TMC loi mnlh-graders “the standard euoi of a scoie is 3.7.” If a hoy 
weie tested many times on a senes of equivalent mechanical compiohen- 
sion tests, Ins scoies would vary, the standard euor is a calculated estimate 
of the amount of this vanation It says^th;it~the"'seric.s of raw semes Un tins 
one hoy would have a standard deviation of about 3 7 Since the staudaid 
deviation of scores of diffaient persons is 10.4 points, a staudaid emu ol 3.7 
allows the boy’s position within the gioup to shift over an appieeiable range, 
as Figure 21 shows When we test the boy only once, he earns just one ol 
Ins many possible scoies. We do not know in which pat t of his lange vve hap¬ 
pened to catch him. 

Since scores vary from one trial to another, no one measure can be (nested 
absolutely The o btained scorejnchcates only roughly the level of the poison's 
ability or typical behavior 'Jhe_smaller the standard error, the more precisely 
Ins level can be judged Reliabilit y inf ormation tells how much confidence 
we can place~in a measurement. 

J Reliability always ref ers to consiste ncy throughout a series of’measure¬ 
ments There" are various ways to observeTuch a series—for example, by 
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using the same test repeatedly or by using a series of “parallel” foims Dif¬ 
ferent cxpenmental proceduies measure the effect of diffeient types of varia¬ 
tion, and theiefoie diffeient relia bility coefficients mean different things 
Test semes vatv ovei tune Attention and effort change fiom moment to 
moment Ovei longei peiiods, fuilhei shifts in score are cieated by physical 



gipwlli, leamiiig, changes in health, and peisonality change If we employ 
dillei cut test items foi each measurement, anothei type of variation is intro¬ 
duced The pci son who is lucky on one tnal, finding items that are easy foi 
him, will encounter n nia nnhai items on some olhei tual and eam a lowei 
scou> To these vamtmus must be added the unaccountable "chance” effects 
Chance effects entei even when wc use llie same piocoduie twice in uipid 
- : - p - - nstan- 

suceessicm. Hie 

— , lation 

taneous l.ipses _ 

in test semes 

A judgment that a student has completed a course 01 that a patient is 
icady 1m lelease fiom tlieiapy must not be seriously influenced by chance 
earns, tempmaiy vaiiations in pciformanec, 01 the testei s choice of ques¬ 
tions. An emmeous lavmable decision may be mevoisible An enoneous mi- 
favmalile decision, though leveisible, is unjust, chsrupls the poisons moialc, 
, m d i eta, ds his development. Unless the testei and subject iccogni/e how 

lallible a nieasme is, they are. likely to icly on it mme than is justified. 

• ’ ’ -' In most expeamental 

j an ohsoived chffeience 

u'sulls fiom the experimental treatment or couiu uc accounted foi by chance 
v ana turn The, huger the chance variation in the test employed, the haidei 
it is to find a significant difference between groups Laige eiroi vanance 
masks scientifically important variations created by the experimental condi¬ 
tions. Making a test more reliable improves the efficiency of an experiment 
m the same way that increasing the number of subjects does 
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TABLE 12. Possible Sources of Variance In a Test Score 


-r individual 


I La r *- U ~ 

W ■ , 

2. , _ ‘Ions, testwiseness, techniques of taking tests 

3 Ability to solve problems of the general type presented in this lest 

4. Attitudes, emotional reactions, or habits generally operating In situations like the test 
situation (e g,, self-confidence) 

II. Lasting and speci fic characteristics of the individual 

T Knowledge and“skifirVequired by particular problems in the test 
2. Attitudes, emotional reactions, or habits related to particular test stimuli (e.g , fear of 
high places brought to mind by an inquiry about such fears on a personality test) 

III Xsippotary and general characteristics of the individual (systematically affecting per¬ 
formance on various tests afa particular time) 

1. Health, fatigue, and emotional strain 

2. Motivation, rapport with examiner 

3 Effects of heat, light, ventilation, etc. 

4 Level of practice on skills required by tests of this type 

5 Present attitudes, emotional reactions, or strength of habits (insofar as these are de¬ 
partures from the person's average or lasting characteristics—e.g, political attitudes 
during an election campaign) 

IV ^Temporary and specific characteristics of the individual 

f Changes in fatigue or motivation developed by this particular test (e.g., discourage¬ 
ment resulting from failure on a particular item) 

2 Fluctuations in attention, coordination, or standards of judgment 

3. Fluctuations in memory for particular facts 

4 Level of practice on skills or knowledge required by this particular test (o.g., effects of 
special coaching) 

5. Temporary emotional states, strength of habits, etc, related to particular test stimuli 
(e g , a question calls to mind a recent bad dream) 

6 Luck in the selection of answers by "guessing" 


Soujvor After U L Thorndike, 1940, p. 73 

In a lest intended for predicting a definite cntorion, reliability is less im¬ 
portant than predictive validity. If predictive validity is satisfactory, low 
reliability does not discourage us from using the test In compaiing two 
"tests which'"measure, the same thing, however, the moie accurate test will 
have the higher validity coefficient. 

1. Locate each of fhe following sources of variance In Table 12- 

a. During a speeded test a student breaks his pencil and loses time while ob¬ 
taining another. 

b An industrial worker who has been in this country for a short time misunder¬ 
stands an important phrase in the instructions for a performance tost, 
c. A “hillbilly" is unable to answer correctly a question from an intelligence 
test about the purchase of a railroad ticket, 
d A suspicious patient refuses to cooperate and gives perfunctory answers, 
e. A student guesses at every item of which he is uncertain. 

2. Give an example of each source of variance in Table 12 which might affect per¬ 
formance on the Block Design test. 

3 Which types of variation in Table 12 would lower the correlation between test 
and criterion in each of the following situations? 

a. A test of high-jumping ability is used to select finalists fn a track meef v The 
criterion is performance in the meet, two weeks after the trials. 
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b. A pencil-arid paper test of mechanical ability is used to predict performance 
of mechanical trainees. Piecework earnings after training are used as the 
criterion. 


Interpretation of Coefficients 

(^Reliability is usually expiessed in terms of a "reliability coefficient,” i.e., 
fth'e corielation between two measurements obtained in the same manner, 01 
in terms of the standard eiror of measurement which we have already de- 
sciibed Befoic considering the piocedures used to estimate reliability, we 
can discuss general pnnciples applying to all such coefficients, The pnn- 
ciples are as follows. 

® A reliability coefficient tells what proportion of the test variance is 
nonerioi variance. 

o The reliability coefficient depends on the length of the test 

« The reliability coefficient depends on the spiead of scores in the group 
studied. 

® A. test may measure leliably at cine level of ability and unreliably at 
another level, 

• The validity coefficient cannot exceed the square root of the reliability 
coefficient. 

Reliability and Error of Measurement. Variation between persons is de- 
"scribed by the standard deviation s, or by the scoie vanance s 2 . Tins varia¬ 
tion repiesents a combination of the difTeienccs that we wish to measuie 
(c.g., true ability in spelling) and the variation associated with a particular 
measurement (eg., woids used m the test, fatigue of some persons on the 
clay of testing). The true ab ility of any person would remain c onstant from 
one mcasuiejo anothe r, but obta ined scores would vaiy to some extent. 

This conception of a “true score” assumes that we would reallyTIke to 
delcimme the person's scoie on a very huge sample of behavior. For employ¬ 
ment puiposes, we would like to know what proportion of words the stenog¬ 
rapher will spell correctly during the next several yeais. We test perfoim- 
anoo on only one day and on only one set of words, tins is a small sample of 
the total perfonnanee which we wish lo estimate. In a school spelling test, 
the teacher may want to estimate performance on a paiticular assigned 
test of woids on a paiticular day. But this estimate should ideally cover the 
pupil’s “true” knowledge on that day, as observed on many, many trials Any 
one trial on a particular woid is a small sample of his pei form ante on that 
word. The true xr.nrp. is the average score the person would obtain if the pei- 
formance were obseived by a very lo ng senes of samp les or tr ials (assuming 
no practiceeffect from the testing). Error iTdefinecf as the variation or fluc¬ 
tuation of the person’s scores within the series It is a sampling error arising 
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because any one trial exposes only a portion of the behavior that interests us 
The size of these sampling errors is described by the standard orroi of 
measurement (s c ), or by the “enor variance” (sy“). The obtained scoic is a 
combination of the tiue scoie and the error on a pailiculai tual. ihe 
variance of obtained scores is the total of the cnoi variance and the vaxiaucc 
of tiue scoies 

These variances have a diiect lelation to the coirolation between scenes 
fiom two samples of behavioi If we let r u stand foi the lehabilitv coefficient 
(correlation between scores )T, 

_ V _ s' — s 2 _ Tiue vaiianeC ' 

,n _ s 2 ~ s 2 ~~ Total variance 


From the data foi the TMC, we find 


Total 

Eiroi 

Tiue 


= 10 4 
= 37 


94 5 


r = 108 2 
s, 2 = 13 7 


94 5 (by subtiaction ) 


= 87 


_ 108 2 ' 

The reliability coefficient tells what piopoition of the test vananee is due 
i p i , ,1 pj" , ,, 'is,- pi ■ g uioi In tins example, 

“ ' 1 'i \ 1 1 i * i e' n' r n' u !•[ ‘Kent is "enor,” just 

what we mean by “enor” is defined in pait by the oxpciimenl.il piomlme, 
as we shall see later 

Reliability and Test Lengt h The impoitance of lengthening tests is that with 
eveiy question added, the sample of peifoimanec becomes a nunc adequate 
index of perfonnance on all possible questions. A single addition pioblem is 
a veiy poor sample of a person’s ability, since we aio quite likely to present 
a number combination that is particulaily hard oi easy foi lmn By asking 
more and moie questions of the same general soit, we come closer to a good 
estimate of his general ability on addition pioblems 

Longer tests are also less influenced by otliei clung* iaclois. II a test lias 
only five multiple-choiceltems, a lew people might get all the items eon eel 
}, | just by guessing In a fifty-item test, practically no one could do yell b) 
guessing, V ariations due to guessing tend to cancel ou t Tln'ee 111 teen-minute 
observations of a child’s social behavior piovido a poor sample <>1 Ins typical 
behavioi, tlinty obscivalions, howcvci, should give a dependable pictuic. 

The Spcaiman-Biown formula (see Computing Guide 6) penul ts us to 
estimate what i pliability the test would have if it wcie length ened oi slim"l- 
ened The formula assumes that when we change the length of tlip Lesl yve 
y do not change its natuie Extieme increases in test length, however, intro¬ 
duce boiedom and may reduce reliability Furthermore, unless one is care- 
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ful, added items or added periods of observation may not cover the same be¬ 
havior or ability as the onginal test. 

One must examine the reliabi lity of ever y sco ie he intends to interp iel. j u 
Some testeis, knowing that a test as a whole is reliable, place faith in its part 
semes also. Since slant tests are likely to be unreliable, a pait score based on 
a few items is of limited value. The reliability of part scores as well as of 
total scoi os should be given m the test manual. If this lehabihty is low 01 un¬ 
known, the pait semes cannot be relied upon 

While inaccuracy lowers validity, this does not necessanly argue for 
making piedictm tests very long An increase in test length has a great ^ <T 
effect on lehability blit a much smaller effect on validity The following 
foinuila applies, wheic /„ repiesents a test n times as long as test t (Gulliksen, 
1950b, pp 88ff ). 



The ohseived lest-uilenon conelation is r, t Undei the square root sign, 
hi is the ohsened lehability foi test t and is the reliability of the longei 
lest, calculated by the Spcnuman-Biown foimula Figme 22, deiived horn 
die foimula ahose, shows the effect of lengthening or shoitening the TMC, 
using i(, as 40 and i,i as .87 As \vc lengthen the lest, its lehability ap- 
pioaehes 1 ()() .ieiending to the Spearman-Brown foimula (bioken line) Tlie 
inereasi 1 ill sulklitv is espeeted to follow the solid line in the figuie. As the 
test is nude longer and longei, validity appioaches .43 as a limit Validity 


1. Suppose that a test has a 
known reliability The Spear¬ 
man-Brown formula, given 
at right, estimates the re¬ 
liability of the score from a 
similar test n times as long 


nr 

1+ (n - 1 )r 

where 

r is the original reliability, r„ is the reliability 
of the test n times as long 


2. To predict the reliability of a 
test twice as long as the 
original lest, substitute in the 
foimula n 2 


3 Suppose the original test is 
to be reduced to only half 
its original length The re¬ 
liability of the short test is 
estimated using n = j. 


If r = 


40, r 2 = 


2 1.40) 

1 + 0) 40 


80 
1 40 


= 57 


n/2 — 


11 . 40 ) _ 

1 +(! - 1) (.40) 


20 

1 - It 40) 



COMPUTING GUIDE 6 USE OF THE SPEARMAN-BROWN FORMULA 
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FIG 22 Reliability and validity of the TMC as Its longth is changed 

creeps up so slowly that lengthening a test beyond a certain point is un¬ 
profitable Beyond that point it is better to use the extra time to measure 
some othci aspect of behavior 

4 A spelling test of thirty words has a reliability coefficient of .80. What reliability 
would be expected if ninety words were used? 

5. Which kinds of variation (Table 12) are reduced in influence when a test is 
lengthened? 

'0) Relation of Reliability to Valid ity. An inaccurate test cannot lie a good 
■"predictor. Theie is a rule which states how reliability limits validity: The' 
icon elation between the test and an independent criterion can never be 
• (higher than the square loot of the correlation between two forms of the 
test. For example, if leliability is 64, validity cannot exceed .80. This con¬ 
clusion is denved fiom the formula lelating test length to validity, Suppose 
the test is so closely related to the cnterion that the two would be pcutectly 
coirelatod if the lest weie fice from enoi of measuiement. That is, suppose 
that when n becomes exliemcly largo, r Un = 1.00 and r ,— 1.00. Then, ac¬ 
cording to the formula given above, r t , = l Vh„(„. 

Why is it that a lost can cm relate lrighci with a diffcient measure Ilian it 
does with its own twin? To undeistand this, considei two shoit spelling tests, 
and a “criterion” based on exhaustive measurement over scveial weeks, with 
thousands of woidsr Each test score is much influenced b y random e rr or of 
sam pling a nd g uessing , but the criterion is not The enors in the two tests 
lower the correlation between them Just one such set of errors affects a test- 
critenon correlation, which theiefoie is higher than the test-test correlation 
which both sets of errors affect. 
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j > Reliability and Range of Score s. Enors of measurement are most trouble- j t 
some when the clifTcicnces in ability or personality among the subjects are 
small. When we arc lunng one peison out of a group of applicants, the final 
decision often lunges on a difTeience of a few points between the best and 
next-best man Slight eirois of measurement m such a case might lesull in 
hiring the poorer man. If one is screening applicants for factory work 
meiely to uile out incompetents, however, a less-iefined test will be satis¬ 
factory Emus of a few points cannot conceal the gioss deficiencies in ability 
which distinguish the “hopeless” horn the average lun of woikeis. 

Assuming that eiror of measurement lemains constant, we see from the 
formula given above that decicascs when the vanance of tiue scoie j ^ 
decreases 

A test which lias satisfactoiy leliability for use with a wide-iange grou p ? ^ 
may be un satisfactoiy m a highly selected gio up A lathei crude mental test * 
can be used to identify which pupils cnteiing school have mental handi¬ 
caps; but to divide the handicapped gioup, determining who is to be placed 
m a special class, requites a much moie accurate test A cnterion latmg by a 
supeivisoi may lie adequately leliablc foi distinguishing failuics fiom 
acceptable men but is not so good for telling winch men within the satis¬ 
factoiy gioup ju» liuly best 

6. The reliability of a test is ,95 in a group for which s is 20. What will the re¬ 
liability be in a group where s is 10? (Compute s„ for the wide-range group, 
and use this value to compute the reliability for the second group ) 

^Reliability at Different Score lev els, When a single leliability coefficient is 
ieportcd, wo tend to assume that a test has the same accmacy foi all types of 
people This assumption is often mconect Many tests aie leliablc only 
at ceitain levels of peiloim ance, The Gates Reading Survey for Giades 
3-10, foi example, gives reliable estimates of reading skill for pupils in most 
grades. When third-giadeis lake the test, they find it so difficult that they 
do a gicat deal of guessing. As a result, individual differences within the 
thiid grade aic unreliably lcpoilcd. Tests, no mallei how reliable, arc mac- 
cmale foi pupils whose seoies aie near the chance level. Easy tests give in¬ 
accurate measiues of individual diffeicnces in the exliemely high ranges 
of talent 

Figuic 23 shows the scores of Navy lecruils who took a pitch-discrimina¬ 
tion test twice If the test weie accurate, the two seoies foi each man would 
be nearly the same and all points would fall along the diagonal Ime, The test 
consists of 100 pairs of tones, in each pair, the man lepoited whether the 
second tone was higher or lower than the first, A score of 50 would be ob¬ 
tained by pure chance. Accordmg to die scatter diagram, high scores are 
fairly reliable Men scoring 85 on the first test fell between 72 and 95 on re- 
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20- 24- 28- 32- 36- 40- 44- 48- 52- 56- 60- 64- 68- 72- 76- 80- 84- 88- 92- 96- 
23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 up 

Score on First Test 


FIG 23 Test and retest scores on pitch discrimination (Ford ef al, 1944) 

test. But men scoring near the chance level (eg, 55) scatter eel widely oil 
the retest (40 to 87) It is evident that the standard oiroi of low semes is 
gicat 

The biokcn line shows the average scoic, cm the second test, <>( men in 
each column The upciuve of this line al the left is especially interesting 
Many men with voiy low scores in the Gist test did well on the iciest A scene 
ol 25 is loo fai below 50 to be a chance scoie Piobablv men basing such 
low scores on the hist test nusundeistood directions and judged the fiist tone 
instead of the second Following directions correctly on the retest would 
shift then scoies fiom seventy items wrong to seventy items light 
A test should be appropriate m difficulty for the decision to be made. Fig¬ 
ure 24 shows distributions of scores on several tests given the same gioup. 
The very easy test A may be quite satisfactory foi rneasunng differences at 
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the lower end of the group Test A is probably unreliable for the gioup as a 
whole, since vanution of only a few points causes a subject to diop from the 
top of the group to the jveiugc. Furtheimoic, it does not distinguish be¬ 
tween the peisons lying at 100, even though they piobably are not equally 
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FIG 24 Distributions of scores for several tests given to the same group 

able Test B is diflieult The distribution is skewed in the opposite direction 
fiotn that lot test A, high scores spicad out, hut differences at the low end 
of the scale ate too small to distinguish individuals reliably A noirnal dis¬ 
tribution, such as that foi test C, spreads out cases at both ends oi the scale 
With a symmetrical distribution, scoies at the two ends aie likely to be 
equally reliable. Foi this reason, tests yielding loughly noimal distributions 
are piefened where it is necessary to distinguish equally well all along the 
scale If fnn decision requites us only to distinguish the best men, test B is 
efficient. If we need only to eliminate the poorest men, we could use A. 

7 Which distribution in Figure 24 would be most desirable in each of the following 
cases? 

a. A psychologist wishes to measure liberalness of attiudes, to study its relation 
to voting habits 

b A college wishes to pick out freshmen needing special training in reading 
e. A test for college guidance measures interest in medicine 
d. An employer wishes to select the best statistician from a group of applicants. 
8. The California Test of Personality, Elementary, contains several subtests, one of 
which is Feeling of Belonging. A low score on this questionnaire is said to indi¬ 
cate maladjustment According to the test manual, the percentile rank cor¬ 
responding to each possible score is as follows 

Score 1 23456789 10 11 12 

PR 1 1 1 1 1 5 10 15 25 40 65 90 

a. How would a boy's standing in the group change if his score changed two 
points? 

b What is the shape of the raw-score distribution? What does this distribution 
imply regarding the usefulness of the test? 


136 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


9. In World Series baseball, some pinch hitters reach batting averages as high as 
750, whereas the best regular players rarely exceed .400 for seven games. 
How can this be explained? What principle regarding reliability does it il¬ 
lustrate? 


Types of Coefficient 

A person’s scores vary from time to time and fiom foim to form of the test. 
Some of these variations are legarded as a weakness m the measuiing pro¬ 
cedure, ie,, as “enoi ” But the meaning of “eiroi” depends on the purpose 
of testing If a score is supposed to indicate a peison’s tempoiaiy condition 
at the time of testing, it is desirable for scoies to vaiy from moment to mo¬ 
ment. If the score is supposed to repicsent a lasting quality, moment-to-mo- 
ment variation is undesirable Considei, for example, a test purpoitmg to 
measure rigidity of Blinking This might be used to pi edict peifcumance as a 
scientist, to measure the level of adjustment of a patient dining and after 
therapy, oi to measure the effect of a paiticular stiess applied dunng an ex¬ 
periment. Should variation of a person’s seoie ovei time be loguided as hue 
variance or as eiroi vaiiance? Stability would be a gicat advantage m pie- 
dictmg scientific success where we need to measuie a lasting chaiactciistic. 
Since we want an estimate of behavim over a long peiiod, instability fiom 
occasion to occasion is error, for this use of the lest Likewise, we need a 
stable measuie of ngidity to judge a patient’s status aflei thcuipy, there is 
no point in knowing that lie functions well today if he is likely to think ng- 
idly next week On the other hand, if the theiapist wants a wcck-to-wcek 
barometei of Bie patient’s lemporaiy state, stability would be a disadvan¬ 
tage Likewise, to measuie outcomes m a stiess experiment, tlio lest must be 
sensitive to momentaiy states of the individual. Too stable an instiament 
would be of no value foi these two purposes 

For a comprehensive undeistanding of the test, we would like to know 
what propoition of the vaiiance can be asciibod to each of the four cate¬ 
gories of Table 12 We obtain such estimates by making two m moic meas¬ 
ures of each poison and then con elating the semes m pm forming a variance 
analysis. Different experiments have to be made to measuie each type of 
vanation. Figuies 25 and 26 help to explain the vanous lelmbihly coeffi¬ 
cients Each experiment treats some types of vaiialion as “cum”, in the dia¬ 
grams these aie left unshaded, while the noneiror variance is shaded. 

The fust pioceduie to be considered is the retest condition jibtained by 
administenng the same test on two occasions This is called a coefficient of 
stability, beca use it tells us h ow stable this paiticular performance is. Gen- 
eral-lastin^characteSristicrTe^gT^nTth^TMcT ge^ml~undeistanding"of le¬ 
vers) enter boBi measures A person high in this ability tends to be high on 
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both trials. Specific-lasting characteristics also affect both measures similarly; 
knowledge of a paiticular fact about air pressure on an airplane rudder will 
help one on that paiticular item of both test and retest This characteristic 
contributes to variation between persons, not to variation between trials for 



General 


Specific 


Tem¬ 
porary Lasting 



jj 


■ 


FIG 25 The coefficient of stability Shading indicates the portion 
counted as true variance 


the same poison. Tcmpoi ary factors (eg., health, or casual variation in 
memoiy) may help an individual on one occasion and lower his scoie on 
the othei They llitnefoio lowei the test-relost consistency and aie counted 
as cuor m this type of reliability (Figure 25). 

The coefficient of equivalence tolls how well the test score agiees with 
other equivalent incasvacs made at the same time. It is obtained by giving 




Internal-Consistency Procedure 



FIG, 26. Tiro coefficient of equivalence. Shading Indicates the portion 
counted as true variance. 


two founs (c.g,, Form A and Form B of the TMC) in close succession. The 
two forms should be closely comparable, measuring the same general at¬ 
tributes at the same approximate level of difficulty As Figure 26 shows, gen- 
eial attnbutes affect both tests the same way But since the tests include dif¬ 
ferent items, a specific attribute like knowledge about die airplane rudder 
helps on one foim but not the other It therefore lowers the correlation be- 
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tween forms, “Internal-consistency” procedures, which also lead to a coeffi¬ 
cient of equivalence, aic discussed below. 

A third procedure, less commonly used, intioduces an appreciable delay 
between test and equivalent test. Now, both changes in the poison and sub¬ 
stitution of new items lower the correlation and are included in the cuoi 
variance This coefficient reflects both the stability and the equivalence of 
the measures 

If all three coefficients are obtained, we can deteimme how much of the 
lest vanance is due to each type of variation, but such information is not 
often available Foi the Mechanical Reasoning Test of the DAT battery, 
however, these eonelations are repoited (in various sources) 

Foim A with Foim B, immediate .85 

Retest after three years, Foim A 78 

Form A with Form B, thiee-yeai interval 65 

These facts pcimit us to construct Figuic 27 by subtracting the vaiious es¬ 
timates liom I 00 and from each other. As Figure 27 shows, only laslmg-geii- 

Total variance 100% 

Delayed parallel leit correlation 65, hen<u 

"Truo (lasting - general variance] 65% 

"Error' (TS + TG L5) 35% 

Immediate parallel lost correlation B5, banco 
"True" (LG t-TG) 05% 

Subtracting, lomporary - general /cinanci* 20% 

"Error" (TS+-LS) 15% 

Delayed retoil correlation 73 hence 
"True" (LG+’LS) 73% 

Subtracting, losting - specific variance 8% 

"Error" (TG+TS) 17% 

Subtracting, temporary — specific variance 7 /o 
FIG 27 Distribution of variance in the DAT Mechanical Reasoning Tost 

eral components count as hue variance in the delayed between-loinis cone- 
lalion, theiofoie, 65 pci cent of the test variance is lasting and geneiul. In the 
immediate bolween-foims conelation, lasting-geneial and tempmary-gen¬ 
eral vaiiance aie both counted as nonemn By subliai tion. Hit' temporaiy- 
geneml vaiiance must be 85 peicent less 65 peicent, oi 20 peicenl. By fin- 
thei subhaction, we find that the temporary-specific and lasting-specific 
components account for 7 percent and 8 peicent, respectively. YVe conclude 
that most of the score vaiiance is due to general abilities and habits, rather 
than to information specific to particular items Quite a huge propoihon of 
the vaiiance is due to characteristics which remain stable ovei three years 

10 Prepare a diagram resembling Figure 26 for the delayed equivalent-forms co¬ 
efficient 
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11 . A teacher gives a standardized test of knowledge of scientific facts to his class 
in chemistry. Several students make scores lower than he had expected, 
a. He asks, “Could it be that I gave a form of the test which included many 
questions these particular pupils happened not to know? Would their scores 
have changed much if they had been asked other questions of the same 
type?" What reliability coefficient answers this question? 
b He asks, "Could the performance of these students be due to the fact that 
they were having an 'off day? Does a pupil’s score on tests of this type 
vary much from day to day?" Which coefficient is most helpful in answering 
this question? 

12 Which types of variance are to be regarded as “error" when a questionnaire 
regarding emotional problems is used for this purpose- 

a. To select high-school pupils with whom the counselor should have an early 
conference 

b To identify recruits likely to break down in service 

c To identify the area within which a pupil has conflicts, as a preliminary to 
a counseling interview 

13. One favorite method of estimating reliability is to split a test in two parts, 
score them separately, and correlate This correlation between half-tests is 
treated by the Spearman-Brown formula to obtain a coefficient for the full test 
Show that this is properly considered a "coefficient of equivalence" even 
though only one form is used 

14 What can you say about the variande make-up of a test, knowing only that a 
delayed retest gives a coefficient of .70, and an immediate parallel test gives 
a coefficient of 80? 

15 Given these facts about a tost measuring "liberality" of political attitudes, pre¬ 
pare a diagram similar to Figure 27, 

Between-forms correlation at same sitting .90 

Between-forms correlation, one year apart 60 

Retest correlation, one year apart .65 

16. In speaking about hearing tests for children a writer says "Physical and psy¬ 
chological changes from day to day may make tests at two sittings less valid 
than a complete test at one sitting We find that we get worse results on cloudy 
days than on sunny days ” 

In what sense is the word valid used? Can you defend the contrary statement 
that scores at two sittings would be more valid than a complete test at one 
sitting? 

Coefficients of Stability 

Retest cocllit lonls might be obtained with any interval between tests fiom 
a few minutes to sovcial \cais. II the two tests aic close togethei, the person 
will remembci some ol bis fonnei answeis This cany-ovei makes the retest 
correlation a little higher than the coirelation between two independent 
mcasin es 

Thcie is no single coefficient of stability for a test, rathei, theie is one for 
each time interval The longei the time between tests, the lower the coeffi¬ 
cient of stability, as we shall see when we study results for the Stanford- 
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TABLE 13. Illustrative Reliability Coefficients 


Test 

Sample 

Procedure 

Type of 
Reliability 

Coefficient 

School and Col¬ 
lege Ability 

Tests 

370 high-school 
seniors 

Kuder-Richard- 
son formula 

20 

Equivalence 

.93 (verbal) 

.91 (quantitative) 

95 (total! 

Pintner General 
Ability Test, 

Non Language, 
Intermediate, 
Form K 

203 12-year- 
old boys in two 
communities 

Odd-even cor¬ 
relation 

Equivalence 

.86 

Short Employ¬ 
ment Tests 

About 230 can¬ 
didates for 
nursing school 

Testing with 
parallel forms 
on same occasion 

Equivalence 

.87 to 92 
for three sections 

Short Employ¬ 
ment Tests 

72 machine 
opei ators in 
a bank 

Retest after 
two years 

Stability 

.71 to .84 for 
three sections 

Allport-Vemon 
Study of 

Values 

48 persons (not 
otherwise 
described) 

Correlation be¬ 
tween com¬ 
parable half¬ 
tests 

Equivalence 

.49 to ,84 for 
six scores 

Allport-Vemon 
Study of 

Values 

Not described 

Retest after 
three months 

Stability 

39 to 84 for 
lix scores 


Binet test. Foi 7-year-olds, the immediate retest correlation is about ,90; it 
declines steadily, so that after four years the retest correlation is only ,74, 
and after eleven years it is only 68 (see p 176). Fiona this we can estimate 
that at least 22 peicent of the test variance at age 7 results from individual 
diffciences which aie accurately measuiable at the present moment hut will 
be altered by time and expcilence. 

The testei must inteipiet infoimation on stability m the light of his pro¬ 
poses If he intends to make long-iangc predictions oi to measure a trait 
which is supposed to be constant, he wants stability ovei long periods. For 
othci uses of tests, stability ovei a long time is of little impoi lance. 


Coefficients of Equivalence 

The testei usually wants to know the peison's standing on some geneial 
quality of which the test items aie representative. Veiy larelv die the qual¬ 
ities to be measured so specific that they must be measuied by just one 
set of items In the TMC, for example, the aim is to measure the poison’s 
ability to solve virtually any mechanical problem If scores depended very 
much on content of specific items, the test would be an unsatisfactory pre¬ 
dictor except for enteria to' which this specific knowledge is relevant. The 
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correlation between equivalent forms is therefore likely to be the most useful 
index of test lelinbility 

Wheic only one form of a test can be given, an internal-consistency pro- 
cediuc is used as a substitute for a between-foims coefficient In the split- 
half method, the lest is given in the usual fashion but then is scoied in two 
paits It is ncccssaiy that the two halves be independent, so that success on 
an item m one half docs not help with an item in the othci half. Coirelating 
the two parts gives a coefficient of equivalence for the half-tests, and the 
Speaiman-Biown foimula is often used (with n = 2) to obtain the coeffi¬ 
cient for the full test. A bcttci estimate is obtained by the formula 


r tt = 2 




where s a and Si aic the standaid deviations of the half-tests The two for¬ 
mulas give very neaily the same lesult in most studies The coefficient ob¬ 
tained is an estimate of the coefficient of equivalence between two full- 
length tests which are as similar as the two halves are 

Two internal-consistency fonnulas developed by Kuder and Richaidson 
aie often used to obtain coefficients of equivalence for tests where one-point 
is given for eveiy coirect answer and zeio foi a wiong answer. “Kuder-Rieh- 
aulson Formula 20” gives a coefficient foi any test which is equal to the aver¬ 
age of all possible split-half coefficients. The KR20 coefficient may geneially 
he taken as a good appioxinnition to an equivalcnt-foim coil elation. Coeffi¬ 
cient KR20 is computed from the propoilion passing each item, and from 
the standaid deviation of scores (Guilford, 1956, pp. 454-455). 

The second foi inula, KR2I, is less accurate but very simple to compute. 
Tins foimula can be used by any tester to gel quick estimates of the coeffi¬ 
cient of equivalence* m his group, if his test is scored by the “number light” 
foimula. The quantities used in the formula aie the mean (M), the number 
of items (k), and the standaid deviation (s). 

fhlrM ~ k -"Ty 1 k? ) 


For most tests this formula will give very nearly the same lesult as KR20, 
hut sometimes it gives a much lower coefficient. When the two estimates 
differ by a huge amount, the decision as to which coefficient is most lelevant 
depends upon technical considerations which we cannot tieat here. 

Intcimil-consistency proccduies cannot be_uscd.._with speeded (time¬ 
limit) tests, because the paits of the test are n ot independent, A person who 
gets stuck on one item and spends extra time on it will fail to reach the items 
at the end of the test The correlation of items within a trial is therefore 
higher than the correlation between items separately administered, and tire 
split-half oi Kuder-Richardson reliability is spuriously high The Primary 
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Mental Abilities Tests for Ages 11 to 17 are given with short time limits, 
yet the manual repoits only split-half reliabilities’ Anastasi and Drake 
(1954) administeied the half-tests with separate time limits in ouloi to get 
a propei estimate of reliability, and eompaicd these with reliabilities com¬ 
puted by the spunous single-admimstiation method The insults aie as fol¬ 
lows foi the four PM A tests: 


Verbal, separately timed halves 

90; single admimsti ution 

Of 

Reasoning 

.87, 

92 

Space: 

.75; 

90 

N umbei 

83, 

92 


It is obvious that the split-half estimates from the single udnnmsUation me 
inflated and give too tnvoiable an impiession of the test 

17 A classroom teacher gives a forty-item test with mean 34 and s d 3. What is 
the KR21 reliability? 

APPEAL TO THE LAYMAN 

When a patient loses faith in the medicine his docloi piestiibes, i! loses 
much of its powei to linpiove his health lie may skip doses, and m the end 
ho may decide doctors cannot help him and let tieatmcnl lapse ullogclhei, 
Foi similai leasons, in selecting a lest one must considei how woitli while it 
will appeal to the subject who takes it and to othoi laymen who will see the 
lesulls. 

If an applicant for a job is given an employment Lest which lie umsuleis 
silly oi unielated to the job, he is likely to be loscnlful This will make it dilfi- 
cult to obtain valid scores If he is not lined, he mav excuse his lailuie bv 
criticizing the test, wliat he says to Ins fnends damages public relations and 
makes it haider to obtain job applicants Even the successful man may led 
that he was lined in spite of the lest, and begin woik w ith antagonism to- 
waid management Some satisfactoiy woikeis have had little schooling and 
aie clistiustful of tests winch piobe theii weaknesses, catch questions and 
questions which scorn childish aie especially likely to arouse tnlieism 

If a test is inteiesling and “sensible,” taking it is likely to lie a pleasant e\- 
penence. This not only tends to make the semes valid but also heljis to es¬ 
tablish good lelations between the peisonnel wmkei and the subject, An 
Italian bus company contracted with psychological lahoi atones m two cities 
to give tests to applicants for jobs as cluveis Aflei a few months, it was 
found that most of the applicants weie liavohng to Koine—going as much as 
100 miles faither than necessaiy—because the Rome centei had duboiate 
testing apparatus while the second centei used simple equipment to meas- 
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ure the same aptitudes. The applicants thought the elaborate tests fairer and 
moie dependable. 

Biitish expei ience with War Office selection boards is a second case in 
point. The selection board observes candidates duiing several days of field 
testing, appaiatus tests, discussions, etc Before this system was established, 
men fiom the ranks niiely applied foi commissions because they thought the 
tests then used gave an advantage to applicants fiom good homes and good 
schools. They think the selection board is a fairei system where a man can 
show his ti lie ability, and this attitude has been of great assistance m recruit¬ 
ment of officeis—so much so that, although the board system is costly, and 
veiy possibly less valid than objective tests could be, theie is no thought of 
changing it. 

The subject is not the only one who must be satisfied with the psycholo¬ 
gist’s tests The Biitish selection piogiam has had to satisfy a Labor cabinet 
insistent that porn boys have a fail chance to become officers, the parents 
of the boys tested, and the old-line officers who train the accepted men A 
psychologist who installs a highly valid industrial selection progiam will find 
it in the ashcan a yeai lalci unless he convinces both management and 
the union that the test is fail. Useis of test results have strong prejudices If a 
gioup of social woikeis is accustomed to mental test A, the psychologist 
who decides to substitute mental test B will encounter difficulty. Even if test 
B is moil* accurate than A, the social worker may disiegard results fiom B 
because it does not have his confidence So important is user acceptability 
that the psychologist woikmg with teachers, industrial personnel men, or 
physicians must often use a test which would be his second or third choice 
on the basis of technical qualities alone 

A test which looks good for a paiticular purpose is said to have “face va¬ 
lidity,” Adopting a test just because it appeals reasonable is contrary to sci¬ 
entific piactice; many a “good-looking” test has failed as a predictor. Civil 
service exnnnneis, foi example, prepaied two tests to measure ability m al¬ 
phabetic filing One gave, five names per item—John Meeder, James Med¬ 
way, Thomas Matlow, Cathcnne Mcagun, Eleanor Meehan—and asked 
which name would be third in alphabetic oidei. The othei test requited the 
subject to place a name in the pioper place in a senes; foi example: 

Kobeit Ciustons A- 

Richaid Caireton 

B_ 

Roland Casstar 

C_ 

Jack Corson 

D__ 

Edward Cranston 
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Though the makers were confident that the tests were representative of the 
same skill, and though both tests had reliabilities above .80, they coirelated 
.01 (Mosier, 1947) 

Such evidence as this (leinforced by the whole history of phrenology, 
giaphology, and tests of witchcraft!) is strong warning against adopting a 
test solely because it is plausible If one must choose between a test with 
"face validity” and no technically verified validity and one with technical 
validity and no appeal to the layman, lie had better choose the lattei The 
job of the tester is, after all, to get information which improves decisions. 
The tester should seek and usually can find a test which has both face va¬ 
lidity and technical validity. 

18 . If a clinical tester is examining a criminal to establish whether he is mentally 
responsible, he may have to present his results in court and stand cross- 
examination on them (I, Frank, 1956) In what ways might his choice of tests 
differ from those he would use in examining a similar case at the request of a 
hospital psychiatrist? 

19 . A certain examination for French secondary-school admission was deliberately 
made very difficult to obtain a skewed distribution, since only a small number 
of places was to be filled. When the children told of the questions at home, 
parents organized protest meetings which ultimately brought the problem to 
the attention of the Minister of Education, who decided to give a second test 
to those who had failed Do you agree with this decision? 

■Horf To CHoe>$e Te-ST.'-'' 

PRACTICAL CONSIDERATIONS 
Ease of Application 

In almost any field, one can choose between tests to be administered by 
untrained persons and tests which can be given only by an expert The test 
which is simpler to apply will have more complete dnections and simple 
objective scoring, and xequiics no observation or judgment by the tc.stei. The 
more complex test oilers moie compiehensive findings, but only in the hands 
of a well-qualified tester Attention should also be paid to the adequacy 
with which the manual assists the usei in chawing conclusions fiorn test re¬ 
sults. This is especially important when a psychologist is choosing a test 
whose results many other poisons will consult 

A test manual may present all the impoitant information about the test 
and yet fail to communicate to the leadei Lennon found, indeed, that 
laige numbers of schoolteachcis fail to grasp even simple factual statements 
in an achievement test manual (Lennon, 1954, pp 90-94). The implication 
is that a person in chaige of a testing program must make a major effort to 
educate all those who will give or mteipret tests, rather than to rely on the 
manual to convey the insights they need. 
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Equivalent or Comparable Forms 

We have noted, in connection with reliability, that equivalent forms are 
often available. Equivalent or parallel forms are tests measuring the same 
tiling at the same level of difficulty, so that equal raw scoies have the same 
meaning on each form. They are especially valuable when each person is 
tested twice—for instance, before and after therapy. The use of new ques¬ 
tions rules out the effect of memory. An equivalent form is useful also for 
checking a dubious score. A second test would be given, for example, when 
the tester suspects that emotional disturbance spoiled the first test. 

Two tests are said to be comparable when their raw scoies can be con¬ 
verted to the same derived-score scale, Some school achievement tests are 
oigamzed in a senes at different levels of difficulty so that the pupil may be 
tested each year. Although the tests are not equivalent, a scale is piovided 
so that performances on the easier lest and the harder test can be compared 
to deteimine the pupil’s gain. Another type of compaiability is seen in the 
DAT piofilc clnut, which permits comparison of mechanical aptitude with 
other abilities. Such comparisons, based on the same norm gioup, have an 
obvious value for interpretation. 

Time Required 

Time available for testing is always limited, and therefore short tests are 
prefened, othoi things being equal Too long a testing period boies the sub¬ 
ject and makes him uncoopciative. Whcie moiale is high, however, one can 
give very long testing batteries successfully. We have aheady seen that re¬ 
liability, and to a lesser extent validity, depends on test length Shortening 
tests to a few items will destroy their value, but not much is gained by 
lengthening tests beyond 100 items per score except in competitive exami¬ 
nations where a few points make a gieat difference. The Bennett TMC has 
sixty items and rcquiics about Unity minutes foi most adolescent subjects. 
This is a usual length for a lost yielding a single scoie. 

Multiscore vs. Single-Score Tests 

A test or battery yielding several scores has to be longer than a single- 
score test, if its scores are to be reliable. It is difficult, however, to state 
whether a multiscore test is superior to a single-score test occupying the 
same time. The single scoie is likely to be much moie reliable than the sev¬ 
eral scores of the other test. The tester who needs seveial facts about the 
individual may prefer to obtain somewhat uniehable answeis to all these 
questions rather than to measure one dimension precisely and remain with- 
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out information on the others. A high-school counselor would obviously pre- ' 
fer ten-minute tests of five specialized abilities to a fifty-minute test which 
measures mechanical reasoning very precisely and tells nothing about the 
other four. He piobably should not go so far as to substitute ten tluee-mmute 
tests for the thirty-minute TMC—but it is haid to define a perfect balance 
between breadth of coverage and precision 

When many decisions are to be made, each lequning a different sort of 
information, the best solution is to allow a large amount of time foi gathering 
information. Theie are limits to this, however In clinical diagnosis of chil¬ 
dren with behavior problems, for example, one could think of hundieds of 
tests and observations which might shed light on different aspects of devel¬ 
opment. An employer hiring an executive can likewise raise a veiy huge 
number of questions While no general rule can be given as to tile best divi¬ 
sion of limited testing time, it is clear that the gicatest amount of time should 
be given to the most impoitant questions Wheie theie aie seveinl questions 
of about equal importance, it is definitely moio piofitable to use 1 a bnef 
test giving a rough answer to each one than to use a piecise test which an¬ 
swers only one 01 two questions (Cronbach and Clesei, 1957) 

The disadvantage of quick, ciude measuies disappeais when we make 
them a fust step m a sequential measuring piogram In lining employees, 
for example, the veiy pool prospects can be weeded out by a nilliei inaccu¬ 
rate pencil-and-papei test, and sometimes those who score veiy well on such 
a test can be hired at once. Then only the applicants non the boidoihno 
need he given an accurate and moie costly test In testing an individual for 
guidance or diagnosis, we can begin with a multiscoie test coveting many 
variables (a battery of slioit aptitude tests, 01 an inletest measme eovenng 
all fields) A further, more precise test can then he used foi any variable 
(eg., clerical aptitude, or interest m speaking activities) which looks impor¬ 
tant on the basis of the fiist results 

Cost 

The cost of the usual test is only a few cents, but when one is testing a 
large number of peisons, a cliffeience in cost may be woith some attention. 
Fortunately there is little relation between the cost of tests and then qual¬ 
ity, so that even a limited budget peimits the use of well-constiacted tests 
Cost is greatly reduced wheie it is possible to use an answei sheet and a le¬ 
asable question booklet The reusable TMC booklet costs about 18 cents per 
copy (in packages of 25). The answer sheets cost an additional 4 cents In 
determining the cost of a test, one must consider not only the cost of the 
materials but also the cost of scormg 

A fairly representative figure on costs of testing is suggested by the 
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charges of a nonprofit test scoring and rental service operated by a state uni¬ 
versity for the high schools it serves (Unit on Evaluation, 1955) Costs in¬ 
clude shipping, handling, matciials, scoring laboi, and other items. A school 
can rent the typical test booklet for 3 cents and purchase the answer sheet 
foi about 3 cents. The scoring charges vaiy with the numbei of scoies ob¬ 
tained foi each sheet, lrom about 4 cents per pupil up to 15 cents This 
means that guidance testing foi sixty pupils, measuring five aptitude and 
achievement aieus, would cost the school under $30, including all costs ex¬ 
cept tcachei time foi administering the tests and time used to interpret re¬ 
sults to students Since the decisions for which the tests are used make a 
great difiei encc in the educational efficiency of the school and the soundness 
of the pupil s life plans, concern about cost of testing should be given almost 
no weight m choice of tests 

EVALUATING A TEST 

We have now introduced neatly evciy concept that is used in judging the 
adequacy of a test Subsequent chapteis will descnbe the various types of 
tests and apply these concepts. In that application, the concepts will be ex¬ 
plained moie completely. Even though those chapteis need to be studied 
befoie attempting to diaw conclusions about paiticular tests, we can here 
summaiize the concepts and piesent a foim useful in evaluating any test 

The development of a testing piogiam iequiies, first of all, a cleai pur¬ 
pose. As we pointed out cailier, one must seaich foi a test that fits the deci¬ 
sion to be made, not just foi "a good test of leading” or “a good peisonahty 
test” It is unrealistic foi the student of testing to evaluate a test in the ab- 
stiact, yet one cannot eonsidei all possible applications simultaneously. Foi 
this leason, it is suggested that any test manual be appioached with a defi¬ 
nite measmemont problem m mind Our form carnes a space foi entcung 
this puipose, which might lie specific (selecting girls for training as puncli- 
caid cleiks) oi lalhei geneial (obtaining information for subsequent use m 
counseling high-school pupils as pioblems arise). 

Oidinaiily the testci’s situation icstucts the type of test that may be con¬ 
sidered. It deteiimnes the choice between gioup and individual tests, the 
age oi ability lange, and the level of intcipretalive skill to be used. Tims a 
test to be given foi latei inference in lngh-school counseling will be a gioup 
test (since individual testing of a whole school is not practicable), will have 
to cover the whole lange of a normal population, and will have to be suitable 
for interpi etation by counselor and probably by all teachers and adminis¬ 
trator, With such crude specifications in mmd, one turns to publishers’ cata¬ 
logs, the Buros Yearbooks, texts on measurement or applied psychology, etc , 
and makes a list of tests to consider, The form presented in Table 14 is a 
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convenient record of facts and opinions regarding tests examined in detail. 
We have numbered the entiles to make it easier to connect our discussion 
with vaiious entries. 

The top section (entires 1-9) includes the simple descriptive facts which 


TABLE 14. A Form for Evaluating Tests 


1 Title 
2. Author 
3 Publisher 

4. Forms and groups to which applicable 

5. Practical features 

6 General type 

7 Date of publication 

8. Cost, booklet, answer sheet 

9. Time required 

10 Purpose for which evaluated 

11. Description of test, items, scoring 
12 Author’s purpose and basis for selecting 
items 

13. Adequacy of directions; training re¬ 
quired to administer 

14 Mental functions or traits represented in 
each score 

15. Comments regarding design of test 


16 Predictive validation (criterion, number 
and type of cases, result! 

17 Concurrent validation (criterion, number 
and type of cases, result) 

18 Other empirical evidence indicating 
what the test measures 

19 Comments regarding validity for par¬ 
ticular purposes 

20 Equivalence of forms or internal con¬ 
sistency (procedure, cases, result) 

21 Stability over time (procedure, time 
interval, cases, result) 

22 Norms (type of norms, cases) 

23 Comments regarding adequacy of re¬ 
liability and norms for particular pur¬ 
pose 

24 Comments of reviewers 

25. General evaluation 

26. References 


can often be obtained fiom the catalog They are for the most part self-ex- 
planatoiy, It is suggested that you enter (6) one oi two words to describe 
the general type, so that completed analysis forms may be filed by cate¬ 
gories. In the specimen form filled out foi the Bennett TMC (Table 15), we 
have inserted simply the woid aptitude. The date of publication (7) is not 
highly significant, since some older tests aie excellent. An old test should 
ordinarily be scrutinized with special caie, however, since some items may 
be obsolete, the noims may not be useful, and the manual is likely to be in¬ 
complete. One should give particular attention to the date of the last thor¬ 
ough levision of the manual 

(This is one of the places where publishers sometimes intioduco mislead¬ 
ing mfonnation to make a test moie appealing, It is possible to copyright 
the test manual every year so that it looks up-to-daio, even though no real 
change has been made. Such embellishments will not confuse the leader un¬ 
less he gives undue weight to supeiiieial values. Thcie aie many such half- 
truths or misleading claims in manuals Some can be spotted by any alert 
reader, while others aie identifiable only by an expeit. If the reader finds 
that the manual or test advertising is unto ustworthy in one respect, he must 
of course view all the remaining infoimation with suspicion, The scientific 
and ethical quality of the manual, however, is not always a sign of die qual- 
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ity of the test. A few excellent tests have manuals which are open to severe 
criticism for exaggerated claims.) 

The next step is to form an impression of the test by examining the items, 
the scoring principles, and the aims the authoi had in mind in pieparing the 
test. It is this impression which largely deteimmes the appeal of the test to 
the subject and to other laymen. In the form (11), one can descnbe the 
items super ficially and should also list the titles of subtests to be separately 
scored. Attention should be given to the objectivity of scoring 

The author’s stated intentions (12) help one to understand the nature of 
the test. One should be hesitant to use it for a quite different puipose, al¬ 
though this is sometimes defensible The manual will usually indicate 
whethei the author was interested m selection, guidance, clinical use, or 
classroom evaluation and will often tell what aptitudes or traits he had in 
mind in pieparing items. The souice of items is paiticularly impoitant if the 
test is to be interpreted on the basis of content validity 

Many test manuals repoit statistical studies used in selecting items. The 
most common pioeeduie is to conolate the item scoie with the total scoie 
on the test, discaidmg items which do not seem to measuie the same thing 
as the rest. Though this procedure is likely to improve a test—particulaily 
since it eliminates ambiguous items and makes the items moie similai to 
each other—it docs not necessanly impiove validity. Indeed, narrowing the 
range of content (in a mechanical aptitude test, foi example) can lower 
validity by covering the field less thoroughly. For this reason, item-test corre¬ 
lations should never be lefeired to as “item-validity coefficients.” The con¬ 
sumer usually cannot evaluate the technical piocedures used in test construc¬ 
tion. 

Directions (13) can be examined with regard to their clarity and the ex¬ 
tent to which they standardize the test. 

An armchair analysis of the test items should be made (14) to judge what 
abilities, experiences, work habits, or personality traits influence the score. 
Such an analysis is lcquiicd for each of the subscoies which is to be inter¬ 
preted. One cannot hope to identify all the contributing variables, but the 
effort raises questions to be used in interpicling validity studies and helps in 
interpreting the test. The leport should stale what the score seems to indi¬ 
cate, whether tins is what the authoi intended oi not, and also list inelevant 
variables which are likely to distort scoies. As a pait of this analysis it is usu¬ 
ally desirable to take the test oneself or to administer it to a suitable subject 
and observe his perfoimance. 

Empnical evidence of validity (16-19) may be of various sorts. Spaces are 
provided for predictive and concuirent validation, and also for other studies 
bearing on construct validity. Some of these spaces may be irrelevant for a 
particular test or a particular purpose and if so would not be filled in, While 



TABLE 15. Evaluation Form for the TMC 


1. Title Test of Mechanical Comprehension 

2 Author. George K Bennett (with Diana Fry, Wm A. Owens In some forms) 

3. Publisher. Psych, Corp 

4 Forms and groups to which applicable 
AA- high-school boys, job applicants 

AA-Fi French language form of AA, questions and directions in both languages 
BB experienced workers, advanced students 
CC engineering students, high-level job applicants 
W-li high-school girls, female job applicants 

5. Practical features Can be machine-scored 

6 Generdl type Aptitude 

7. Date of publication 1940, 1947 (AA), 1941, 1951 (BB), 1947 (W-l), 1949 (CC), 

8 Cost, booklet, 18((, answer sheet, 4?) 

9. Time required. 30 min 

10 Purpose for which evaluated. Vocational guidance of high-school pupils 

11. Description of test, items, scoring. Pictures of simple apparatus. Questions in 3-cholce 
form (5-choice in CC) as to what will happen to an object when force is applied, which of 
two structures is most stable, etc. Ob|ectlve scoring (R — JWj R In CC) Only one overall 
score obtained. 

12 Author's purpose and basis for selecting items, Intended to measure an ability required 
in many jobs and training courses. Past experience is allowed to affect scores, but the 
Items require understanding rather than rote knowledge Items were put through Various 
stages of criticism and tryout; Items retained were those discriminating high scorers (on 
the total test or a pooled Mech. Comp, score) from low scorers. 

13. Adequacy of directions, training required to administer, Directions are unusually clear 
and simple Classroom teacher can handle Answer sheet for all forms save CC involves 
awkward matching of arrows to line up with booklet 

14 Mental functions or traits represented in each score General experience with machines 
common in Western world, understanding of simple principles of motion, energy Formal 
physics helpful but not required. Solutions can be intuitive or deductive, more rigorous 
deduction required in CC Unspeeded 

15 Comments regarding design of test, Highly efficient Use of correction formula in scoring 
unnecessary but harmless. Note that no claim is made that the test measures on innate 
aptitude. 

16 Predictive validation (criterion, number and type of cases, result) Manual gives reference 

to numerous military and industrial studies where TMC was correlated vyith grades In 
technical training or |ob ratings Coefficients range from 30 to 60 * 

Evidently generally useful, though the test usually must be supplemented by general 
mental or verbal measures Information Is lacking on usefulness of the test for prediction 
in hlgh-school courses, or on long-range predictions from high-school testing. Such data 
are available for DAT version. Form CC has validities 28 to .50 for college freshmen, with 
performance in technical courses as criteria Note that range is restricted, compared to 
h.s. group 

17 Concurrent validation (criterion, number and type of cases, result). Some of the "pre¬ 
dictive" studies cited above may be concurrent, manual Is not clear as to time separating 
test and criterion 

18 Other empirical evidence indicating what the test measures. Study of 1471 applicants 
for fireman-policeman jobs shows that high-jcbool physics raises scores about l J.d on 
AA (This information is needed—and not now available—for CC, where effect of physics 
or math might be greater.) 

Form AA correlates about 50 with general mental tests in wide-range groups, BB and 
CC correlate 20-30 among applicants to engineering school Considerable overlap 
with spatial tests (50 with Minn. Paper Form Board in wide-range group, 66 with College 
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Board spatial test among engineering-school applicants). Correlation of 30 with tool 
dexterity test Factor-analytic studies" show a mechanical experience factor prominent in 
the test, there are also substantial loadings with general mental ability, and spatial or 
visualization ability 

19 Comments legarding validity for particular purposes. Test has predictive value for |obs 
or courses Involving nonroutine machine operation Overlaps general and spatial tests, 
so that its independent value would depend on the situation Lack of data on predictive 
power in high school limits interpretability 

20. Equivalence of forms or internal consistency (procedure, cases, result). Split-half method, 
Form AA, 500 ninth-grade boys, r — 84 Similar coefficients for other forms, lower in 
groups of restricted range. Interform correlation about 80 for BB vs CC, no others 
reported 

21 Stability oyer time (procedure, time Interval, cases, result) No information presented 

22 Norms (type of norms, cases) Each manual offers several columns of percentile norms for 
various groups in schools and industry. Selection of groups is poorly described (e g , "833 
ninth-grade boys," "417 applicants for unskilled |obs") For CC, tables are given sep¬ 
arately for two specific engineering schools. 

23. Comments regarding adequacy of reliability and norms for particular purpose Re¬ 
liability of 80 is satisfactory but low A second test should be given if a few points' 
difference in scoie would alter a decision 

The norms presented have limited usefulness, the counselor Would have to obtain norms 
for his school, for special courses in the school, and if possible for the local |ob market 
Absence of information on stability prevents confident use of the test for long-range pre¬ 
dictions eailier than twelfth grade 

24 Comments of reviewers 

"The manuals of directions are models of conciseness and honesty. . There Is little 
doubt that the test measures comprehension of many mechanical principles, but Its value 
for prediction has been questioned on the ground that several items involve principles or 
facts which one is unlikely to encounter in everyday mechanical experience, outside of a 
physics course" (Charles M, Harsh, in Buros, 1949, p 720), 

"The Tost of Mechanical Comprehension , should prove to be a useful tool es¬ 
pecially to those persons engaged in educational and vocational guidance It should also 
find increasing usage in the technical school and the industrial employment office It is an 
attractive test, the items are intrinsically interesting, all the forms appear to have been 
well constructed, and they are easy to give The range of usefulness of the test will un¬ 
doubtedly Increase as more validity data are made available" (George A Softer, in 
Buros, 1949, p 723). (Note Considerably more evidence of validity has appeared since 
the date of this Comment) 

25 Geneial evaluation This is an exceptionally popular test, various versions having been 
included in a large number of prediction batteries and having repeatedly shown value 
against mechanical criteria The concreteness of the test makes it appealing to subjects 
and laymen, when used In guidance, it dramatizes the concept of special abilities 

form AA could be used in ninth grade to explore aptitudes, and CC could be used with 
somots considering engineering or technical courses. The DAT Mechanical Reasoning Test 
is a tevision of the Bennett TMC which should be preferred in high-school guidance for 
numerous reasons superior norms, more substantial validity Information against high- 
school criteria, comparability to other tests of the DAT battery, more information on 
stability 

As compared to other mechanical aptitude tests, the Bennett and DAT are less de¬ 
pendent on either shop experience or dexterity The test is a measure of understanding 
and intellectual trainabihty, it does not guarantee proficiency without training, nor skill 
in manual peiformance 


11 The meaning of information of this sort will be considered in Chapters 9 and 10 
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most users examine only the evidence presented in the test manual, one can 
evaluate the test better if he consideis data published elsewhere For some 
tests, the volume of reseatch is so great that it can only be summarized 01 
sampled. Under heading 18 any study might be listed that helps to estab¬ 
lish what the scoie measures. One might list, for a mechanical aptitude test, 
evidence of its ovcilap with geneial intelligence, of its dcgiee of speeding, 
or of the extent to which physics students earn better scoies than those who 
have not studied physics. Factoi analyses are often relevant to this question, 
Heie specifically, it is necessaiy to select the most important inhumation 
from that available 

The final evaluation of validity (19) is the most impoitant single entry m 
the form Does the test give the infoimation needed to make the intended 
decision? What degree of confidence can be placed in it? What level of psy¬ 
chological tunning is lequucd to inteipret the test as pioposed? To leach 
such an integrated conclusion, it is necessaiy to weigh positive and negative 
evidence, to decide which of several contiadictoiy findings is most trust¬ 
worthy, and to judge the adequacy of the body of evidence as a whole It is 
especially important to note what necessaiy evidence on validity is lacking 

The next major section (20-23) consideis reliability and nouns This in¬ 
formation is usually presented m the manual and needs only to be summa- 
lized. The most common faults in reliability infoimation aie faihuo to lepoil 
subtest reliability, and application of internal-consistency formulas to 
speeded tests, Norms must be examined ciiiically for repicsentativeness, 
and foi relevance to the usei’s own situation. 

It is of course important to examine whatever critical leviews aie avail¬ 
able, and the recoid form includes a space (24) for quotations which sum¬ 
marize the revicwei’s evaluation The general evaluation (25) is a final sum¬ 
mary of the advantages and limitations of the test for the paiticulax purpose, 
considering both its technical and its practical features. It is appiopimte to 
compare the test with otheis having the same general function. One can 
also point to supplementary infoimation which should be combined with tlu* 
test. Special ways of applying the test, over and above its use as a measming 
instillment, should be noted. These would include making supplementary 
obscivations duiing the test, examining responses to obtain cues fm diag¬ 
nosis, using test lesponses as a point of depaiturc in a counseling interview, 
etc 

An analysis of this sort for every test under consideration (whether the 
analysis is put m writing or not) provides a basis for a total testing program. 
A program is more than a list of good tests, A program will be designed so as 
to minimize wasteful overlap and timed so as to get each piece of informa¬ 
tion when it will be most helpful Testing cannot be planned by itself. In 
industry or the armed forces it must be dovetailed with recruiting, training, 
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and assignment In the clime, testing must be considered as part of the 
whole thciapcutic effort The final program states what tests will be given 
and when, and how the results will be used in assigning the poison 01 in 
helping him to understand himself. 


Suggested Readings 

Anastasi, Anne Test icliability. Psychological testing New York Macmillan, 1954. 
Pp 94-119 

This textbook chapter covers essentially the same principles and techniques 
about reliability as the piesent chapter includes, 

Rothncy, John W M, & others, Test scoies etiology and interpietation Measure’ 
ment foi guidance. New Yoik Haipci, 1959 Pp 116-150 
How to lead a test manual and test adveitismg cntically is discussed, with 
numerous examples These authors maintain a seveiely critical attitude towaid 
tests, demanding a closoi approach to peifection than does the piesent text 
Wesman, Alexander G Reliability and confidence Test Sett) Bull, 1952, No 44 
(Available on lequest fiom Psychological Coipoiation Also lepnnted in Ii H, 
Remmois & others (eds), Growth, teaching, and learning, New Yoik, Harper, 
1957, Pp 419-457 ) 

In a simple presentation Wesman coveis the major difficulties in the mteipie* 
tation of lelrabrlily coefficients repoited m lest manuals, 
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Measurement of General Ability: 
The Binet and Wechsler Scales 


THE EMERGENCE OF MENTAL TESTING 
Tests Before Binet 

TIIE outstanding success of scientific measuiement of individual differences 
m be havi or lias been that of the general mental test Despite the overenthu- 
smsm and occasional mors that have attended its development, the general 
mental test stands today as the most important single contribution of psy¬ 
chology to the piaclical guidance of human affaiis Among mental tests, none 
has been moie influential than that fatheied by Alfred Binet A history of 
mental testing is in large pail a history of the Binet test and its descendants. 

The fust systematic experimentation on individual differences in behavior 
aiosc fiom the accidental discoveiy of differences in reaction time among 
astronomers In 1796, an assistant named Kmnebrook at Gieenwich Observ¬ 
atory was engaged in recording, with great precision, the instant when cer¬ 
tain stars crossed the field of the telescope When Kinnebiook’s results were 
found to he consistently eight-tenths of a second later than the observations 
of his suporioi the Astionomei Royal, ho was thought incompetent in his 
work and was discluuged. Not until twenty yeais latei did more careful 
study show that the differences between obscrveis were the result of the 
diffeicnt speeds with which they coulcl respond to stimuli. Only gradually 
did such differences come to he lccognized as significant facts about human 
nature, rather than as annoying errors contaminating scientific work. 

Physiologists, biologists, and anthropologists were stimulated by the sci¬ 
entific climate of the nineteenth century to make a gieat variety of measure¬ 
ments of human characteristics. Notable among these early workers was 
Sn Francis Galton, whose interest in differences among individuals devel¬ 
oped from Darwin’s newly published theory of differences among species. 
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During the latter half of the nineteenth centuiy, Galton invented ways of 
measuring physical characteristics, keenness of the senses, and mental im- 
ageiy These methods, though not developed fully by Gallon, saved as 
models for latei tests. In addition, Galton demonstrated that outstanding in¬ 
tellectual achievement tended to occui frequently in certain iamihes Gen¬ 
ius, evidently, was not an accident 01 a gift of the gods, but a natuial phe¬ 
nomenon to be investigated scientifically. 

At this time, psychology was only beginning to erncigc as an objective 
science. Mental processes, it was suggested, could be obseived undo, stand¬ 
ard conditions by an expeiimenter Scientific obscivations, supplementing or 
even leplacing philosophical speculation, could piovide an exact desciiption 
of the relation between the mental and physical woilds This was the aim 
with which Wundt opened the fiist psychological laboraloiy m Leipzig, and 
he and his colleagues did triumphantly establish quantitative psychological 
laws comparable m foim to those of physics. Believing that psychological 
lcseaich should analyze bchavioi into its simplest elements’, lie designed 
techniques for mcasunng vciy limited functions Wundt, hying to establish 
the geneial laws governing all minds, was not concerned with individual 
difleicnces His laboratory pioceduies and pailieukuly Ins mleiost in quan¬ 
titative reseaich, liowevci, had a strong influence on eaily tests In the 
United States as eaily as 1890, J McKeen Catlell was using a nnxtmo of pio¬ 
ceduies fiom Wundt’s and Gallon’s laboratoiics to measuie sensoiy acuity, 
stiengtli of grip, sensitivity to pain fiom piessuie on the forehead, and mem¬ 
ory foi dictated consonants Catlell was first interested in the lange of in¬ 
dividual differences as a laboratory pioblem, but he quickly became excited 
about the practical value of identifying supcrioi individuals by means of 
these piocediues 

This line of effort unfortunately met an eaily debacle when it was discov- 
eiecl that the new tests mcasunng simple elements of bchavioi seemed to 
have no relation to significant practical affairs The ciucnil study was Wiss- 
lw’s work on test scores of Columbia students (Wisslei, 1901) He cone- 
lated college maiks with the Catlell tests, finding such negligible tonelalions 
as the following: reaction tune, —.02, canceling «’s lapully on a printed 
page, —.09, naming colois, .08, audiloiy memory (recall ol digits), 10 We 
now recognize that low conelations weie ceitain to insult, no matter what 
mental functions were tested, because Wisslei’s buef tests weie quite im- 
lehable, especially in Ins highly selected group. The disappointment which 
followed the Wissler study, however, delayed attempts to base an applied 
psychology on the findings of the laboratory 

Wundt tested elements which could be precisely defined, using stimuli 
which could be accuiately contioiled in the laboratory. The tests had va¬ 
lidity in the same way that a chemist’s measuie of the fieezing point of a 



measurement of general ability 


159 


substance has validity; the result describes a clearly defined characteristic 
and is readily interpreted at a superficial level, no matter how much remains 
to be learned about the underlying process. Tests of this soit have an ob¬ 
vious content validity, and continued investigation in the laboratory spins 
an ever stronger web of theory between these measuies and important 
constructs. Then validity for piedichng practical ciitena, howevei, has usu¬ 
ally been negligible (except that coloi vision and othei sensoiy qualities are 
lmpoitant in some tasks) 

For practical prediction, psychologists have relied on tests constiucted on 
quite another punciple. Wheieas laboratoiy tests have mostly dealt with 
nanowly defined functions, most practical tests are complex worksamples. 
When a complex peifoimance is to be predicted, a sample of that veiy per¬ 
formance will often piove to be a good piedicloi To minimize effects of 
specific training and to obtain a test of wide applicability, the test may sam¬ 
ple, not the criterion task exactly, but the general type of reasoning or 
moloi peiformancc requited by the ciitenon The Bennett TMC is of this 
nature The Block Design test is not a sample of a leal task but is an artificial 
task lequning complex reasoning similar to life problems without depending 
on special knowledge 

Practical testing came into psychology from medicme Clinicians dealing 
with mental dclectives and pathological cases needed diagnostic tests. Psy¬ 
chiatrists looked for tests which would distinguish normal from abnormal 
subjects, and distinguish among vanous types of mental disoideis Kraepelm 
and other nineteenth-centuiy psychiatrists used reasoning pioblems and 
tests of performance in continuous work. These tests weie comparable to 
requirements of life outside the laboiatory Though few of the tests of this 
period survive in present-day diagnosis, clinical tests still aie chiefly con¬ 
cerned with complex processes Alfred Bmet, to whom we turn in a moment, 
was a physician by training and he chose tests which could distinguish be¬ 
tween clinical groups—no matter how obscure or complex the “psychological 
meaning” of the tests. 

The Binet tests did have piactical value foi the physician, the educatoi, 
the social worker, and, in modified foim, foi the employer The piactical 
tests of today aie much closei to woiksamples of life peifoimance than to the 
psychophysical measures of Wundt These complex tests will suiely nevei 
be replaced, but ncithei have they shown much recent development Ability 
tests have remained about the same since 1920, and peisonality tests since 
1930. The practical tests of today diffei fiom the tests of 1920 as today’s 
automobiles differ from those of the same period: more efficient and more 
elegant, but operating on the same principles as before. 

Inshuments measuring lelatively limited types of performance have un- 
deigone more radical changes, Factor analysis of ability tests is leading to a 
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conception of abilities and their relations going far beyond that of 1920, 
and numerous tests have been prepared to measure elementary perform¬ 
ances, The original suspicion that simple laboratory measures might have 
important relations to personality is now being substantiated; for example, 
whether a mental patient perceives an intermittent light as steady 01 flicker¬ 
ing is related to his diagnosis The simpler tests are (with rare exceptions) 
too inefficient foi practical use, and m that sense they stand where Wissler’s 
experiment left them. But these tests aie bettei rooted in psychological the¬ 
ory than the complex tests which are most useful at the moment, and.they 
should ultimately have practical value 

The Binet Tests 

Alfred Binet, a -Fiench physician, became interested in studying judg¬ 
ment, attention, and masoning about 1890 His interest in these complex 
mental processes led him to try a gieatei variety of tests than his picdeces- 
sors had used, In studies published between 1893 and 1911, he tiled to 
find out just how “blight” and “dull” children dilfeiedl Having little precon¬ 
ception icgaiding this chffoicncc, he tiied all suits of measures: recall of dig¬ 
its, suggestibility, size of cianium, moral judgment, tactile discrimination, 
mental addition, graphology—even palmistiy! lie found, as did other in¬ 
vestigators, that the tests of sensoiy judgment and otljcr simple functions 
had little relation to general mental functioning, and lvigradually identified, 
the essence of intelligence as “the tendency to take and maintain a definite 
diiection; the capacity to make adaptations for the purpose of attaining a 
desired end, and the power of auto-cnlicism” (Terman, 1910, p. 45). 

The stage was set, then, foi the call in 1904 to produce''the first practical 
mental test/Paris school officials became concerned about their many non¬ 
learners and decided to remove the hopelessly feeble-minded to schools 
whore they could be taught a simplified curiiculum The officials could not 
tiust teachers to pick out the feeble-minded. They did not want to segregate 
the child of good potentiality who was making no effort and the trouble- 
making child the teachei wished to he iul of Moieovei, they wanted to 
identify all the dull fiom good families whom teachers might hesitate to rate 
low, and the dull with pleasant personalities who would be favored by the 
teacher. Therefoie they asked Bmel to assist in producing a method for dis¬ 
tinguishing the genuinely dull, Bmet’s scale, which drew on his earlier stud¬ 
ies, was published m collaboiation with Simon m 1905 In 1908 a revision 
was published, and in 1911 another, 

There was a great demand at this time, especially in America, for objec¬ 
tive methods of investigating psychological development. Although Thorn¬ 
dike was using experimental tests on animals, American psychological re- 
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search had been dominated by introspection, anecdotes, and questionnaires, 
all of which were as fallible as the person reporting Binet’s method, which 
was to a large degree impartial and independent of the preconceptions of 
the tester, was welcomed enthusiastically as a research technique and as a 
means of studying subnormal children. 

In 1910, Lewis M. Teiman began experimentation with the Binet tests. 
He produced the Stanford Revision of the Binet Scale in 1916 This revision 
extended application of Binet’s method to noimal and superior children The 
Stanford-Bmet had immediate popularity and became, lightly or wrongly, 
the yardstick by which othei tests were judged. Although tlieie had been 
various pievious mental tests, the outstanding populanty of the Stanford 
test made its conception of mental ability the standaid. The acceptance of 
the Stanford test was due to the caie with which it had been prepared, its 
success in testing complex mental activities, the easily understood “IQ” it 
provided, and the important practical results which it quickly produced 
Although many criticisms have been made of the test, it was and is an ex¬ 
ceptionally useful instrument 

. The 1916 Stanford-Binet was replaced in 1937 when Terman and Merrill 
f published Forms L and M of the Stanford-Binet These tests impioved on 
> the construction of the foimer edition and offeied two compaiable forms 
The latest revision (1960) combines the best tests of the 1937 revision into 
a single Form L-M and bungs the standardization up-to-date In all parts of 
--the world there have been other versions taken dnectly fiom the Binet test 
or one of the Terman revisions 

More Recent Trends 

Evolution of general mental tests since 1911 has taken two directions. On 
the one hand, individual tests have been increasingly designed to allow il¬ 
luminating observation as a supplement to the accurate oveiall score. While 
the Binet items reveal considerable diagnostic information, they were not 
chosen for this puipose. We have aheady mentioned that much can be 
learned about the child’s personality by watching him solve mazes, and the 
Binet scale includes a lew maze items. Porteus, however, capitalizes on the 
special value of the maze by providing a whole scries of mazes of graduated 
difficulty The Kohs blocks have a similar advantage. The highest develop¬ 
ment of tests for observation and diagnosis are the populai Wechsler scales. 

Whereas this trend led to more elaboiate mental tests and gave great re¬ 
sponsibility to the observer, the other line of evolution was toward simpler 
and more mechanical tests. Procedures which could be applied to large 
numbers of people at once and scored routinely were first demanded for 
military purposes. Several psychologists had devised experimental group 
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tests prior to 1918 When it became necessary in World War I to expand the 
Army at an explosive rate, the Army requested psychologists to provide a 
group test so that inductees who were promising could be given officer train¬ 
ing, those who wcie unfit could be lejected, and the remainder could be 
appropriately classified, In one of the majoi achievements of practical psy¬ 
chology, a gioup including Toiman, Ycikes, and Bingham assembled a test 
whose final veision became famous as Army Alpha Alpha tested ability to 
follow cliiections, simple reasoning, arithmetic, and inhumation It was a 
practical test, easily admimsteied and highly useful to the Aimy, as Figure 



FIG 28 Alpha scores of Army personnel of various ranks (Yoakum and 
Yorkes, 1920) 


28 suggests. It convinced the nation that adequate prediction of success 
could be achieved thiougli mass pioeessmg, and schools and industry 
weie quick to demand tests of this type after the wai. Alpha in a civilian le- 
vision and compatible gioup tests by Otis and otlieis weie extensively ap¬ 
plied 

Since 1920 theie have been changes m lest design. Foi example, wlicieas 
the eaily tests weie highly speeded, time limits aie genetom in recent Amei- 
ican tests The content of today’s geneial mental tests is not, however, gieatly 
diileienl horn that of Aimy Alpha. They are nunc efficient and have better 
nouns, but they aie not diffeient in kind. The inliodmtion ol specialized 
tests such as the TMC has been the most mipoitunt innovation in group 
testing since the 1920’s Beccnt icseaich has ineieased the vancty of things 
the psychologist finds it impoilant to measuie. Though specialized tests are 
being used moic and moic in guidance, clinical woik, and educational and 
mdustnol selection, they aie neatly always supplementaiy to geneial mental 
tests denvccl horn Bmct’s woik 

Do not assume that other lines of approach befoie and after Binet had no 
ment, merely because they failed to attain comparable prominence Early 
workers explored many leads which appear to have been unduly neglected 
(Peterson, 1925). Binet himself (following still earlier workers) made use 
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of inkblots to study imaginative and perceptual piocesses, but this technique 
fell into obscurity fiom which it cmeiged only because Roischach inde¬ 
pendently levived the pioceduie twenty years latei In his monogiaph The 
Experimental Study of Intelligence, Binet described the application of ink¬ 
blot and imagery tests to lus daugliteis, amvmg at qualitative descriptions 
of the way their intelligence functioned which lead as if taken from the 
most modern insults of piojective techniques The possibilities of impioved 
impressionistic pioceduies, which psychologists aie today examining, weie 
neglected while Binct’s psycliomctiic stiategy of summunzmg all intelli¬ 
gence in a single scoie was adopted The accidents of time and place play a 
large part in psychological histoiy, theic was, in 1905, a gieal piactical need 
foi a simple and objective way of summanzmg a child’s general level of men¬ 
tal development, but no populai demand foi analysis of individual patterns 
of thought 

CHARACTERISTICS OF THE STANFORD-BINET SCALE 

In the Slanfoid-Binet (SB) scale, as m eveiy test to be studied, one can 
trace how t he investiga t ois solved foui piob lems which lace the test de- 
signei. First, he must decid e wliat he intends to mcasuic Second, he must 
invent 01 select item s whi ch s ave that puipose . Thud, he must find a meas¬ 
uring unit in whi ch to expr e ss ins ults, since behavior nnely can be desciibed 
in countable units like inches, pounds, 01 hght-yeais. Fouith, he must show 
the validity of the test Knowing how these weie solved foi the SB not only 
in veals vvliei an it made a contribution but also thiows light on its limita¬ 
tions 

Assumptions About Intelligence 

The person making the fust mental test is m the position of the hunter 
going into the woods to find an animal no one has evei seen Everyone is 
suie the beast exists, foi he has been inkling the poulliy coops, but no one 
can desenbe him, Since the foiest contains many animals, the hunter is go¬ 
ing to find a viuiely oi tracks The only way he can decide which one to fol¬ 
low is by using some preconception, however vague, about the natuie of his 
quany li lie seeks a huge flat-footed ciealuie he is moic likely to bung back 
that soit of caieass. If he goes in convinced that the damage was done by a 
pack of small rodents, his bag will piobably consist of whatever unlucky 
rodents show' then heads. 

Binet was m just this position He knew there must be something like in¬ 
telligence, since its eveiyday effects could be seen, but he could not describe^ 
what he wished to measure, as it had never been isolated ''Some workers, 
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then and now, have objected to this circular and tentative approach 
whereby mental ability can be defined only after the test has been made. 
Tests are much easier to interpret if the items conform perfectly to a defi¬ 
nition laid down in advance When faculty psychology was in vogue, many 
separate tests were designed foi the separate mental faculties; reasoning, 
memory, attention, sensory discummation, and so on. None of these tests 
used singly, however, was found to have predictive value. Terman (1916, 
p. 151) explained this as follows 1 


The .i" ■■ 
inl< i ■ 


, 'i 'h I i- < iji , *.c in.\ m p 'l oi oir a>[iect, of 
i■ n . 11 ol ;l ■ > !,■'! i.,u* i i 'h i 1 e p_ > l - a < m I * iparate 


^an d_can not be s eparated by any refinement of experiment. They 
77 ~ i i i ■. 1 mlm I .■ ;■» 1 , Memory, for example, cannot be 


ii -!e 1 'j 'H;.n . ’'c-'io or sense disciimination separately 

from the associative piocesses After vainly hying to disentangle the 


various intellective functions Binet decided to test their combined func¬ 


tional, capacity without any pretense~oF measuring the exact contribu¬ 
tion of each to the totaFpioducL 


Modern diagnostic tests do obtain useful information about distinct as¬ 
pects of ability. In Bmet’s time, though, one of his great conliibutions was 
to replace the idea of separate functions with the concept of general intelli¬ 
gence. Having staited with the idea that some children were blight and 
some dull, he found quickly that those who were best on tests of judgment 
were also superior in attention, memory, vocabulary, etc. In other words, the 
tests were coirelated. The coirelalion shows that there must be some under¬ 


lying unity"among these mental tests When psychologists refer to general 
mental ability, they refer to the characteristic that accounts for the correla¬ 


tion among mental testsi 

Binet refined his idea of intelligence by trial and error If color matching 
does not correlate with olhei estimates of mental ability, it must not be in¬ 


fluenced by the common factor If knowing certain information oonelates 


with the tests of masoning, both must measure intelligence. Out of a study 
f)f his best test items, Binet came to his famous description quoted above, 
The term i ntellig ence test is being icplaced by such terms as test of gen- 
eral_mental abi lity o r test o f genercH~scKolastTc aUlRly, "Intelligence” often 
co nno tes some sort of inborn mental superiority. Performance on the tests is 
influenced by many things nofinoKidecIThThi's concept of “intelligence.” The 
test calls for knowledge, skills, and attitudes developed in Western culture, 
and perhaps better developed m some environments than in others. An “in¬ 


telligent” person will do badly if he lacks the background the test requires, 
A person is born with potentialities which may or may not be developed. 
The Binet scale gives only very indirect evidence on “potentialities”—wel 
can obseive potentiality only when it has been developed into performance/ 
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1. What do the following definitions of “intelligence" include that Binet's definition 
does not, and vice versa? 

a. "The ability to do abstract thinking” (Terman). 

b. "The power of good responses from the point of view of truth or fact" (E L 
Thorndike). 

c. “The property of so recombining our behavior patterns as to act better in 
novel situations" (Wells). 

2. Would the same sort of test items be called for by each of these definitions? 

3. Is previous learning included in intelligence by these definitions? By Binet's? 

^Selection of Items 


To test high-jumping ability, we would ask a boy to jump over standards 
of various heights, beginning with easy ones and increasing the height until 
we found the highest level at which he could succeed. The experimental psy¬ 
chologist uses the same device in measuring weight discrimination The test 
begins with paiis of weights which aie easily discriminated, and the differ¬ 
ence within pairs is gradually reduced until the person can no longer tell 
which is heavier. The Binet scale sets up similar “huidles.” It begins with 
items the subject is expected to pass, but as the items become more diffi¬ 
cult, the subject begins to fail. The test is continued until we have deter¬ 
mined the most difficult mental liuidle he can get ovei 

Binet, studying bright and dull school childien._iealized that me ntal ab il¬ 
it y inc reases wi th a ge. The older child is superior in taking directions, mak¬ 
ing adaptations, and judging his own ideas It follows, then, that a good men¬ 
tal-test item should be easier for older childien than foi younger ones. An 
item should not be used if just 25 percent of children of eveiy age can pass 
it—such an item is difficult, but it does not reflect mental development. In 
selecting items for the B inet test and its revisions, picfercnce wa s givenT o 
items o n which success i s markedly related to a ge Binet further assumed 
that it wits important tolneasureTi genera! ipiality lunning through all men¬ 
tal tasks. Therefore, a good item should correlate with the rest of the scale. 

Items are located in the scale accor ding to tlieii.diIB ciiIt v for ch ildien at 
each ago. A test which about GO percent of I3-yeai-old children can pass is 
placed at Year XIII 

4. When a Japanese investigator prepares a counterpart of the SB for Japanese 
children, would a direct translation of the scale be satisfactory? 



of the Scale 


The child is given the SB by an experienced examiner, who presents each 
item in the precise manner called for by the directions. The examiner be¬ 
gins by establishing rappoit, aided by the high interest the “games” have for 
the younger children and the challenge of the test situation for the older 
child. The first items tried are those for a mental level below that expected. 
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TABLE 16 Representative Tasks from the Stanford-Binet Scale 


Year 

Task 

Correlation 
with Whole 
Test 

Nature of 
Stimulus 

Nature of 
Response 

11-6 

Points to toy ob|ect "we drink out of" 

55 

Verbal 

Motor 


Shows doll's hair 

.63 

Verbal 

Motor 


Names chair, key 

.70 

Object 

Verbal 


Repeats “4-7" 

.63 

Verbal 

Verbal, 

IV 

Names gun, umbrella 

79 

Picture 

memory 

Verbal 


Recalls name of ob|ect {.dog} when 

53 

Ob|ect 

Verbal, 


covered by box 

"Brother is a boy, sister is a , " 

56 

Verbal 

memory 

Verbal 


Matches circles, squares 

.75 

Picture 

Motor 


“Why do we have houses 5 " 

70 

Verbal 

Verbal 

VI 

Defines orange, envelope 

67 

Verbal 

Verbal 


Gives examiner 9 blocks 

77 

Verbal 

Motor 


Maze 

69 

Object 

Motor 


"An inch is short, a mile is . 

67 

Verbal 

Verbal 

IX 

Examiner notches folded paper, child 
draws how it will look unfolded 

.62 

Object 

Drawing 


Verbal absurdities 

.83 

Verbal 

Verbal 


Reproduces design from memory 

.60 

Picture 

Drawing, 


Repeats "8-5-2-6" backward 

52 

Verbal 

memory 

Verbal, 


Figures change from a purchase 

62 

Verbal 

memory 

Calculation 

XII 

Defines skill, /uggler 

79 

Verbal 

Verbal 


Finds absurdity in picture 

51 

Picture 

Verbal 


Defines constant, courage 

85 

Verbal 

Verbal 


Completes "The streams are v dry . . , 
there has been little ram " 

72 

Verbal 

Verbal 

Average 

Defines regard, disproportionate 

86 

Verbal 

Verbal 

Adult 

Explains how to measure 2 pints of 
water with a 5-pmf and a 3-ptnf can 

70 

Verbal 

Verbal 


Explains a proverb 

.73 

Verbal 

Verbal 


Compares laziness and idleness 

.80 

Verbal 

Verbal 


of the child, beginn ing with easy [a sks bui lds confidoiur. Fnsl, Uio basal 
oga, the scale level al w i nch thu ch ild passes all the tests, i s loc ked Tests for 
th e lughenevels tuc then given in oidei, usually six tests aTciulITevol. Te.st- 
m K QQobnue^until-the-child I'ails air Tet[s~ffTcnhL r lcver~ 

/ fn Foim L-M, tests cover levels" of mentaHcIevel op ns cut fiom age 2 to 
Supenor Adult III Fiom ages 2 to 5, thcie aie six tests (plus one alternate) 
at each half-year of development Above age 5, hurdles are spaced one year 
apait, and above age 14, the levels have even wider spacing No child takes 
the entne set of tests A 9-yeai-old would begin with tests at year VIII and, 
if he passed those, would continue until he icached his limit of ability Some 
9-yeai-olds would be unable to go beyond the 11-year level, whereas others 
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would still be passing a few tests at the 14-yeai level One hour, more or less, 
is icquued foi the testing procedure, although theie is gieat variation from 
child to child \y 

Adnnnisteiinc the scale iCannes skill. The testei must exercise consideia- 

f ....... u1 MU 

hie judgment m obtaining fiom each child as cleai answeis as possible, with¬ 
out piobing more than the standardized directions allow. Rapport is espe-n 
cially a pioblem with younger children, who aie not accustomed to tests oir( 
to tasks calling foi sustained/ attention k_ 


The ( % 1T 


:‘l. 1 ■< 


cuiity r 'i ■ ■ 1 g ]L..a!,n n. I' 1 i is- ' ■ »I■■ 1 •' L . • > i ■ i . i , .u 0 

task with a stiang£_a uimmer may be far~~beIow his potential McHugET 
(1943) gave the SB to pupils entering kindergarten and then letested them 
aflei two months Then mental-age scoies increased noaily six months dui- 
mg this pound and then IQs by 6 points, on the average Tasks requuing oial 
response showed twice as much change as tasks calling foi manipulative le- 
sponses MoIIugb suggests that shyness m a new situation accounts for most 
of the diflcience between the fiisl and second lest 
The young child often infuses to tiy items he should be able to solve, as m 
this set ol examine)’s notes on a boy in kmdcigailen (Mayei, 1935, p 325) 


Always smiled and gave a sort of laugh when lefusing to respond, 
but was none the less dcleimmcd Same m school, according to teaclici 
Refused to do things, but always smilingly and pleasantly, but will not 
yield School doctor was unable to give him physical examination “be¬ 
cause he infuses to open Ins mouth oi do anything asked” On Pictures 
he said, “no,” and politely but conclusively turned the page “I won’t tell 
you” was Ins affable answei foi Comprehension, Matenals, Opposite 
Analogies “No, I don’t want to” disposed of all repetitions He pushed 
away Buttoning, infused to attempt the Knot, didn’t want to diaw a tn- 
angle, but was pi evaded upon to try Even with the privilege of deliver¬ 
ing a note to the teachei ofl'oied as a bnbe, he would not complete the 
bud “No,” he said, “I’ll make a pig” He didn’t want to fold a squaic, 
but m maiked conliasl was lus alacnty in responding to Paper Folding 
—Tiiangle “I’m going to make one, loo,” and took the papei to stait bo- 
fmc the examiner could give it to him. 


Rust (1931) found that a child often passes an item he has refused if it is 
piesenled again on another day If credit weie allowed foi such passes- 
aftei-iefusal, ono-quaitci of the 3-year-olds tested would laise then IQs by 
15 points oi more The Mendl-Palmei scale foi-preschool childien contains a 


collection piocediue to take refusals into account, but the SB does not. 
v^/Only those persons should give the SB who have been trained in its use 


and sconng . Teiman suggests that an adequate training piogram calls for a 
geneial course in mental-test theory, a piacticum course during which the 
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student tests at least 25 subjects, and further experience in clinical courses 
where he gives the test to various kinds of subjects Training may be con¬ 
sidered complete when the person has tested about 100 cases under super¬ 
vision, beyond the 25 piactice subjects. 

t The scale includes a great variety of tasks, as can be seen from Table 16. 
Verbal Absurdities (Form L, Year IX; Terman and Menill, 1937) is a lathei 
.typical test. 1 

v / Procedure. Read each statement and, after each one, ask 'What is foolish about 
j/that?” If the response is ambiguous, say, “Why is it (that) foolish?” 

(a) Bill Jones’ feet are so big that he has to pull his tiouseis on over his head 

(b) A man called one day at the post office and asked if llieie was a letter wait¬ 
ing for him. “What is youi name?” asked the postmaster, “Why,” said the man, 
“you will find my name on the envelope.” 

(c) The fiieman huniecl to the burning house, got lus file hose icady, and after 
smoking a cigar, put out the file 

(d) In an old giaveyard in Spam they have discovered a small skull winch 
they believe to be that of Chiistophei Columbus when he was about ten yeais old 

(e) One day we saw several iceheigs that had been cntnely melted by the 
warmth of the Gulf Stream. 

The child passes this at Year IX and lemvcs two months’ credit on his 
mental-age score if ihice losponses aio salisfactoiy Foiu conocl allows two 
more months’ ciedit and counts as a pass at Yeai XII. 

The vanous subtests call for verbal and nonveibal peifoi malices, simple 
memory and complex reasoning, learned answers to familial questions, and 
solution of novel problems calling foi ability to adapt. Tasks involving ob¬ 
jects and pictures are used at younger ages, with an inei easing lchance oil 
; verbal problems through the school ages, and moie tests of ahstiact thinking 
at the upper end of the ,scgle 

Scoring is made as objective-as possible by means of a stonng guide which 
contains specimen acceptable and unacceptable answers. In an absurdities 
item, the subject is expected to leeogmzc clearly the eonlial absurdity, 
and not to bring in melovant malleis 

5. Judge the following answers to the problem "Bill Jones’ feet are so big . . ." 

(from Plntner ef a/., 1944, p 60) as right or wrongs 

a. You can't put them on because his legs are |oined together. 

b You can’t put your trousers over your head because your legs are in them. 

c. He's supposed to put the trousers over his feet. 

d. A man couldn't put his pants over his head 



coring System 


Binet’s plan of successive hurdles makes it possible to report mental de¬ 
velopment in a simple and easily compiehended score called the mental age, 

1 Copyright 1937, 1960, Houghton Mifflin Co,, and used by permission. 
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The subject's mental age is the c hronological age at which the average child 
does as well as th e s ubject does, John, who is only |5, can eamTineatal age of, 
8 if he does as well as the average 8-year-old. Scoring would be simple jf 
the child passed all tests to a ceitam level and failed all tests after that level. 
Because the failures enter gradually, the mental age is determined by add-" 
mg ciedits (usually two months pei test) for each test passed.’ Where test 
levels aie two yeais apait, and six tests compose each level, each'test counts 
four months, where levels are six months apart, each test counts one month 
The total credit in months is converted into a mental age m years and 
months (written thus: 5-8 for 5 yeais, 8 months). 

Table 17 reports the performance of six children The first one, Frank, 
shows umfoim performance; when he begins to fail, he fails nearly all tests 


TABLE 17 Stanford-Binef Performance of Six Children 


Child 

Frank 

Billy 

Herbert 

May 

Bruce 

Nancy 

Age 

6-4 

6-2 

8-0 

5-3 

8-6 

10-3 


Number 

Crodlt 








of 

per Test 

Number of Tests Passed by Child 

at Each Level 

Year 

Tests 

(Months) 







V 

6 

2 

6 

— 

— 

6 

_ 

___ 

VI 

6 

2 

6 

6 

6 

6 

6 

— 

VII 

6 

2 

4 

4 

4 

5 

3 

— 

VIII 

6 

2 

1 

4 

6 

2 

4 

— 

IX 

6 

2 

0 

2 

3 

1 

1 

6 

X 

6 

2 


3 

1 

3 

0 

4 

XI 

6 

2 


0 

1 

0 


4 

XII 

6 

2 



0 



3 

XIII 

6 

2 






1 

MA 

6-10 

8-2 






at that level IIis basal level is VI, so he leceives a base ciedit of 6 years of 
mental age Fum tests passed at Yeai VII add 8 months’ credit, one at VIII 
adds 2 months IIis mental age (MA) is 6 years, 10 months Billy has greater 
“scattei” of ,successes and failures. IIis MA is figured as follows- 


Basal Age VI 

VII 

VIII 

IX 

X 

Total 


6 yrs. 


4 tests, 2 mo, each 

8 mo 

4 tests, . . 

8 mo. 

2 tests,. 

4 mo 

3 tests, " “ 

6 mo 


6 yrs , 26 mo ; 


8 yrs, 2 mo = MA 


Mental ag e me asu res the child’s pprfnrrpanpp it Js m effect the ra w scor e 
on the test. Obviously h e is a bright childlTliis~M7r i5~greater than his life 
a ge. T wo childre n of t he same MA have the same 


\\ but they may differ m pattern of development Young superior children pass 
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different tests than older subnormal children (Magaret and Thompson, 
1950). 

The MA is an estimate of piesent performance and of piomi.se in the im¬ 
mediate future. In any classroom, young supenoi cluldien aic moie nearly 
equal m peifounancc to avciage cluldien than to blight cluldien of normal 
^ age In making decisions within a gioup of vaued age (e g., in sectioning of 
classes) the mental age lather than the IQ gives the most lelcvanl inhuma¬ 
tion In research also"; if it is desucd to equate groups, to separate gioups of 
unequal ability, or to con elate some other variable with mental ability, the 
mental age should be used lather than the IQ. This principle is often vio¬ 
lated. The correlation of IQ with anotliei variable is lower than that of MA 
with the same vanable m a gioup of mixed age 
Mental giowtli is slow after age 15 oi 16 The mental-age units employed 
for higliei ages are not dncctly related to the avciage pci humane c at these 
\ages and thercfoie should be considcied only as another fonn ol mw scoie 
iThe avciage 20-yeai-old has a mental age well below 20. 

6, Compute the mental ages for the remaining four children in Table 17 

7. A 20-year-old passes the following tests XIV, all, Average Adult, 7 tests out 
of 8, credit 2 months each, Superior Adult I, 2 tests out of 6, credit 4 months 
each; Superior Adult II, 1 test out of 6 , credit 5 months each Find his MA 


HE INTELLIGENCE QUOTIENT 


The “i ntelligence qu otient” in the 1960 Slanford-Bmel and neuily eveiy 
othei eurient lesUs nothing mote th an a slandaid scene Instead oUhe com¬ 
mon scale with a mean of 50 and a standaid deviation of 10, the IQ conver¬ 
sion fixes the mean at 100 and the standaid deviation at 16 Since the 10 

-W«W J. 

d istribution is nearly noimal, the IQ can he inteipiet ed a s an indication of 
the child’s positionmTlic gioup TIitTmcntal-uge seoieTsTmiveiled into an 
IQ by refenmg to Uiblcs ii7Tiie Stanfoicl-Bmel manual Tables me pio- 
Wided foi ages fiom 2 to 18. Foi adults, the 18-veai-old nouns can be used, 
Jilthough, as will be seen la lei, the avciage mental-test seme is not sturdy 
constant throughout inability 


This IQ is not icully a quotient at all, and ll it weie not foi its long tuuhtion 
their would he considerable advantage m employing a slandaid-seme scale 
with mean 50, as in other ability tests T he IQ w as oi lg mqlly intio dneed as a 


nitio oi quotient l epics enting the c hild’s rate of mental deve lopment Mcn - 
tal age was divide d by actual age and multiplied bv 100 to lcrnovC the 
deeimaTTfaction Foi blank, whbselige is 6 yeais and 4 monUisTOSO) and 


whdSe~mtmtah-age is 6-10 (6.83), the uitio IQ is 108 Development moie, 
iapid than the average is indicated by a quotient ovoi 100 
The calculation of ratio IQs fell into disiepute for several leasons The 
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quotient was originally thought of as representing a fixed rate of develop¬ 
ment which could not only expiess lelative biightness today but also piedict 
mental age at subsequent ages Evidence to be discussed below indicates 
that the child's late of mental development is not fixed Mm cover, the latio 
depends upon technical chaiactenslics of the scale as well as upon Ins men¬ 
tal giowlh. In the 1937 SB, the standaid deviation of IQs was below 16 at 
ages 5 and 6, and much laigci at ages 3 and 12 These venations resulted, 
not fiom changing lates of development, but fiom the distnbution of item 
difficulties at the vanous ages A thud objection was that the latio IQ could 
not be applied to poisons beyond age 13, where mental-age units become 
aibituuy, Special collections were intioduced to obtain IQs on Foims L and 
M ioi oldci subjects Foi the 1960 levision, the investigators calculated the 
standaid deviation of mental age foi a lepiescntative sample of persons at 
each age, Whatever M\ fell one standaid deviation above the mean foi 
that age was eonveitod into an IQ of 116, A standaid-scoie IQ foimed in this 
manna is often called a “deviation IQ ” 

Dining the inteum while ihe 1960 levision is leplacmg the 1937 revision, 
there will be some coniusnin because IQs on the two scales aic not stnctly 
comparable. A 12-yeai-old who has a ialio IQ of 138 on the older'scale 
would have a deviation IQ ol J32 These chlleiences piobably do not distort 
gicatly ihc mean IQs m typical gioups oi the conelatioiis oi IQ with other 
variables Rcseaich results from Forms L and M may be used in rnteipietmg 
Foi m L-M 

8 Compute ratio IQs for the remaining children in Table 17 

9. If for age 15 the standaid deviation of mental ages is 2 years and 10 months, 

find the deviation IQ corresponding to an MA of 16 

Distribution of IQs 

The distnbution of IQs in the 1937 standardization sample is shown rn 
Table 18. The change to deviation IQs is not expected to altci this distubu- 
lion gieully. The otlioi two columns give compaiative dala foi high-school 
and college samples. 

While comparison ol any pen son oi gioup with the total national popula¬ 
tion is of some value, piactical decisions inquire us to estimate how a person 
will fit into a moie selected gioup Even when the child entcis school, his 
companions aic not lcpiescnlnlive of the total population, foi some sub- 
noimal dnldien are institutionalized or caicd for m the home Community 
and neighboihood cbffcienccs also lcsluct the lange of any class. Thiough 
the giades theie is slow but continuous elimination, especially wheie chil¬ 
dren are permitted to leave school to woilc. The supenoi child is less hlcely 
to leave school than the child who is fmstiated m schoolwoik. The end lesult 
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TABLE 18. Percentage Distribution of IQs 


IQ 

Standardizing 

Sample 

IN = 2,904) 

High-School 
Graduates 
(N = 21,597) 

College Entrants 
(N = 1,093) 

140 and above 

1 3} 



130-139 

3 1)12.6 

97 

31.7 

120-129 

8 2j 



110-119 

18 1 

22 8 

46.1 

100-109 

23 5 

29 9 

18 1 

90-99 

23 0 

23 2 

40 

80-89 

14 5] 



70-79 

5.6)22.7 

14.3 

1 

Below 70 

2 6j 




Noir The standardizing sample data are for ratio IOs on the* 1937 Sttinford- 
Binet (lormtin and Moirill, 1959) Iligli-solionl data nre For gioun U sti ns rnconlLd 
m school files (Semens at al, 1956). College data nre for \VtcmUi-U<»lh*vui and 
WAIS administered to freshmen tit San Jose State College (Plant, 1958) 


is a giadual rise m the average level. A study of a repicsentalive sample of 
diopouts in five school systems, made in the 1940’s, pci nuts us to consliuct 
Table 19 The veiy dull lend to drop out as soon as they ieach ago 16; since 


TABLE 19. Educational Records of 2500 Seventh-Graders 



Below 

85 

Intelligence Quotient 

85-94 95-104 105-114 

115-b 

All cases in Grade 7 

400 

575 

650 

575 

400 

Dropouts in Grades 7 and 8 

93 

30 

14 

5 

2 

Remainder entering Grade 9 

307 

545 

636 

570 

398 

Dropouts in Grades 9 and 10 

241 

171 

143 

78 

29 

Remainder entering Grade 11 

66 

374 

493 

492 

369 

Dropouts in Grades 11 and 12 

52 

65 

81 

55 

25 

Remainder continuing to 
graduation 

14 

309 

412 

437 

344 


Souncr Dillon, 1949 


they aic usually ictarded one 01 two guides, they leave before iho ninth 
grade By the end of high school, almost no one with IQ below 85 is still m 
school. A few of the supenor pupils dicip out because ol lack of inteicst, 
financial pioblcms, and ollioi difficulties. Because tests olhei than the lUnct 
were used, these IQs aie not piecisely compaiahle to Bmet IQs. The IQ 
Hinge in high school is obviously unlike the repicsentalive sample studied 
by Tciman and Memll, and college gioups aie even mine selected, as 
Tabic 18 shows 

Since the lange of abilities vanes from school to school and fiom class to 
class, a final judgment of the pupil’s standing must be based on local norms. 
Local noims change fiom time to time owing to population migialion and 
changing school policies. At the college level, Wolfle’s report (1954, p. 147) 
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piovides an important warning against over generalizing m interpreting an 
IQ He studied 41 representative colleges with the AGCT test, whose scale 
is loughly comparable to an IQ scale, In the highest of the 41 colleges, the 
middle 50 peicent of the enteung fieshmen fell between 126 and 137, m 
the lowest, the middle range was fiom 99 to 117, Cleaily, a student who 
would succeed leadily in one college might be fai below his competitors in 
anothei. Infoimation about the competition to be expected is necessaiy in 
guiding potential college students, both to insuie that they will consider 
colleges wlicic they have a chance to be accepted and to increase theii 
chance of suivival in the college they choose. To assist counselois, many col¬ 
leges now publish leaflets descnbmg the ability distnbution in then entenng 
classes, usually based on the Scholastic Aptitude Test of the College En¬ 
trance Examination Board 


Meaning of Particular IQs 

Some wiitois tiunslate IQ levels into labels such as “nonnal,” "near 
genius,” "feeble-minded,” etc. Tins is misleading, because there is no boider- 
hne at which genius, for example, suddenly appears Some peisons of IQ 
110 make significant onginal conti ibulions, and some of IQ 160 lead un¬ 
distinguished adult lives, Some adults of IQ 80 arc incapable of adjustment 
to the world, and some of IQ 60 suppoil themselves and make an adequate 
home 

A classification of mental deficiency provides a stalling point for thinking 
about the individual case Poisons with IQs fiom 40 to 59, foi example, may 
be labeled morons (Beinreutei and Can, 1938), but while these categones 
are convenient, it is wrong to think of them as pigeonholes. A quantitative 
standaid might seem to be the most just piocedure for determining ad¬ 
mission to ail institution foi the mentally deficient, but this policy, tried in 
the eiuliei days of testing, led to some ludicious lesults, The distinguished 
Czech statesman Jan Musaiyk, dunng a childhood stay m Amenca, was con¬ 
fined briefly in an institution which had such a policy, no doubt because hav¬ 
ing to use a slnuige language pulled clown his Binet IQ (Poileus, 1950, 
p, 40). Clinical disposition of a case is always to be based on a combination 
of mcnlal-tesi data with evidence on the poison’s functioning m social and 
practical situations 

The human meaning of the high IQ is shown m the reseaich by Catherine 
Cox Miles, who estimated the IQs of famous persons fiom then childhood 
histoiies. “Voltane wiote verses fiom his cradle, Coleudge at 3 could read a 
chaptei from the Bible Mozait composed a minuet at 5; Goethe, at 8, pro¬ 
duced literary woik of adult superiority” (Cox, 1926, p. 217). The minimum 
IQs which could account for the recorded facts about these men were esti- 
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mated as, Voltaire, 180, Coleridge, 175, Mozait, 160, Goethe, 190 The true 
IQs might conceivably have been highei, but full evidence was noL available. 

Terman and Oden (1947) followed childien with high IQs into adulthood, 
Considered as a group, these young adults vveie found to be in ovciy way 
supenoi to aveiagc men and women of computable age. line die a lew of 
the facts about then careeis; 90 pneent enleied college and 70 pcicont 
graduated, At an average age of 40, the 800 men had published 67 books, 
over 1400 scientific and piofessional aitides, and ovoi 200 shml stones and 
plays. They had moie than 150 patents to then audit. As Terman says 
(1954), “neiuly all the statistics of this gtoup me horn 10 to 30 times as huge 
as would be expected foi 800 men representative ot the geneial population." 
(See also Terman and Oden, 1959 ) 

The meaning of the IQ is best understood hy one who has obseived many 
childien of known IQ in paitieuhu situations A pailial substitute loi such a 
backgiound may be gathered from the lescanh htoiulme, wlieu* vanous 
wnters have established the IQ lcqunomenls of pailit ulai tasks '[ lie geneial 
tiend of these results is indicated m Table 20 and Figuie 88 These sland- 


TABLE 20. Expectancies at Various Levels of Mental Ability 
IQ 


130 Mean of persons receiving Ph D 

120 Mean of college graduates 

115 Mean of freshmen in typical four-year college 

Mean of children from white-collar and skilled-labor homos 
110 Mean of high-school graduafes 

Has 50-50 chance of graduating from college 
105 About 50-50 chance of passing in academic high-school cur¬ 
riculum 

100 Average for total population 

90 Mean of children from low-income city homes or rural homes 

Adult can perform jobs requiring some |udgmenl (operate sowing 
machine, assemble parts) 

75 About 50-50 chance of reaching high school 

Adult can keep small store, perform in orchestra 
60 Adult can repair furniture, harvest vegetables, assist electrician 
50 Adult can do simple carpentry, domestic work 
40 Adult can mow lawns, do simplo laundiy 


Scniiir i s 
Wi.lfli, 


lk'cklmm, H130, 11 ir.hvn\t utul Jimkt', 
1‘J'll, Giiu/i In iln (Ai ill (lA/ll, IMS 


l'lll, Hu 1,1 unit Hir 1,1,1 

H, uiitl iillii i*. 


il 


xm, 


aids aic not dependable guides foi decisions m specific situations, hut they 
aie nevcitheless woilli study 

10 Why is the mean of college freshmen IQs higher than the level where there is a 
50-50 chance of succeeding in college? Is this a desirable situation? 

11. If the academic curriculum requires an IQ of 105, what does this imply re¬ 
garding educational planning for below-average youth? 
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145-149 
140-144 
135-139 
130-134 
125-129 
120-124 
115-119 
110-114 
105-109 
100-104 
G 95 -99 
“ 90-94 

85-89 
80-84 
75-79 
70 -74 
65-69 
60-64 
55-59 
50-54 
45-49 
40 -44 
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FIG 29 IQs obtained by 7 year olds when tested on two forms of the Sfanford-Blnet 
(Tciman and Morrill# 1937, p 45) 


Error of Measurement 

A coefficient of equivalence, tolling how much the IQ is affected by shoit- 
tenn Minis of measuiemenl, has been obtained by adminisleimg Foims L 
and M a tew days apait The con elation is about 91 foi unsclcctcd cases 
(Tennan and Moiull, 1937, p. 47). This establishes the SB as one of the most 
lobable of all tests Fven so, the aveiage shift of IQ fiom one measmement 
to anolhei is substantial' 5 9 loi IQ 130, 5 1 fox IQ 100, 2.5 lot IQs below 70 
This means that an IQ ol 130 may be 12 to 14 points fiom an estimate made 
foi the same child a lew days later, although such enois aie infrequent The 
best way to visuali/e the enoi of measmement is to study the scatter diagiam 
foi 7-yeai-ulds, lepioduced in Figuie 29 Wc see that the test is moie pxe- 
eise foi low IQs, changes below IQ 80 arc slight Notice also the occasional 
laigc shifts, despite the geneial agreement between the two measures For 
those with IQ 95-99 on Form L, the Form M estimates lange fiom about 87 
to about 112. 
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12. What range of Form M IQs is found among children earning 130-134 on 
Form L? 

13. There are seventeen cases having Form L IQs of 125 and above. Their median 
IQ on Form L is about 134. How many of them earned a higher score on 
Form M? How many shifted to a lower class-interval? 

14. Would the interpretation for any child be changed if his Form M IQ were used 
instead of his Form L score? 

15. What is the largest change of IQ in the chart? 


Stability 

The stability of mental performance has direct piactical impox lance, since 
we cannot make long-range educational and vocational plans if uluht\ 
changes greatly. Evidence on stability is also of great theoielical impoi lance, 
since it throws light on the natuie of intelligent performance, and on the ex¬ 
tent to which performance is predetermined by heredity and by events 
early in life. 

Scores on the lower levels of the SB aio much pooler pmlictois of lain 
IQ than are scores during the school years One leason for inconsistency be¬ 
tween eaily and later tests is that the natuie oi the test items changes, and 
therefore different abilities are measuicd Envnoninontal influences dining 
early years may also develop abilities not shown in oaily test pcifonnanee, 
or may retard those which did show Bayley (1949) retested childien u>* 
peatedly from age 1 month to age 18 years Although lioi lesults aie based in 
pait on special tests for infants and young childien winch we have not yet 
desenbed (see pp 208ff.), the findings apply to any picscnt mental tests 
Table 21 gives conelations between eaihei and latoi measiues In these le- 


TABLE 21. Correlation of Mental Test with Test at a Later Age 


Approximate Age 
at First Test 

Name of First Test 

Years Elapsed Between 
1 3 

First and 5econd Test 
6 12 

3 months 

California First-Year 

.10(CFY) 

05(CP) 

- 13 

02 

i year 

California First-Year 

47(CP) 

23 

.13 

00 

2 years 

California Preschool 

74(CP| 

.55 

50 

.42 

3 years 

California Preschool 

,64 

— 

.55 

33 

4 years 

Stanford-Binet 

— 

.71 

73 

.70 

6 years 

Stanford-Binet 

,86 

84 

81 

77(W) 

7 years 

Stanford-Binet 

.88 

.87 

.73 

.80(W) 

9 years 

Stanford-Binet 

.88 

82 

87 


11 years 

Stanford-Binet 

93 

,93 

.92 

— 


Souncc Bnyley, 1949 Some entries lmve been estimated from dimly roluti a d.il« m Iluvley’s re¬ 
port Initials indicate second test, W stands for Wechslcr-BeUevuo Whine no initial is jmen the 
Stanford-Binet rs the second test 


suits it is clearly seen that the later a test is given, the more stable the IQ is, 
Tests before age 2 are unstable even over short periods Scores show a 
marked increase in long-range piedichve power near age 6. 
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Figure 30 charts the change in scores found upon retesting of behavior- 
problem children with the 1916 revision, The average time between testings 
was 15 months, and the age at first test was generally between 7 and 14. 
A W Biown (1930) comments on these data as follows' 

Although the cou elation . . is high, a large number of cases make 

considciable change, and from the clinical point of view these are often 



Decrease Increase 

Change of I Q from First to Second Test 


FIG 30 Changes In IQ when 1916 Stanford Binet ts repeated after an aver¬ 
age Interval of Fifteen months (A W Brown, 1930) 


the important cases. One hundred eight cases or 15 2 percent change 
eleven points or more. To say that the average change is about five 
points does not help a gieat deal, because in dealing with clinical cases 
one can never be sure that the particular case under observation may 
not be one that will show a large amount of change It would seem ad¬ 
visable iheiefoio to secure at least two latmgs wherevei an intelligence 
mling is especially impoitanl in disposing of the case or m making rec¬ 
ommendations. 

Anothoi study of similar childien over an even longer time (R. Brown, 
1933) found that 3 percent changed moie than 30 IQ points and 10 percent 
changed 21 to 30 points. It is unsound practice to rely on mental tests given 
several years pieviously. Extreme reversals, from IQ 70 to IQ 120, are rare, 
hut some highly important shifts are found in most large groups 
The natuie of IQ variation is seen most clearly m the records of individual 
children who have been tested repeatedly. No typical pattern can be 
shown, for the changes take many different forms The three patterns in Fig- 
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ure 31, selected to show some of the possible trends, are by no means excep¬ 
tional. (These recoids aie plotted m terms of standard scoies, with the mean 
for children in this study taken as zero.) Case 783 is a boy whose test per¬ 
formance did remain stable even though he had a pool health history, an in¬ 
secure and underprivileged home background, pool grades, and emotional 
symptoms such as stammeiing and mimesis. “Theie nevoi was a time in his 



Age 

FIG 31 Records made by three children on successive mental tests (Honzik 
of al, 194B) 

histoiy when lie was not confionted with exlieme ImstiaUons " The I( ( ) none¬ 
theless held to the same satislaetmy level Case iJIfi has had K v )\ as low as 87 
and as high as 142. Ilm patents wme immigiaiits, and unhappily man it'd; 
then conflict led to a divoue when the gill was 7 At ( ). with liei ninlhei ie- 
mamed, the gill was mseciue al home and excessively modest liei lain ie- 
covciy peihaps icllecls bellei adjustment to hei lamih r l lie thud < use (567) 
shows consistent nnpiovemenl Tins gill's emly veais wme inaiked by giave 
illnesses m the lanuly, and the gill heiself was sickly and sli\ Aftei age 10, 
hci social life expanded and she developed u-waiding mteiesls in music and 
spoils This blossoming is paialleled in the test scenes Little is yet known 
about the causes of spuits of this kind. 
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Scoies of emotionally distuibed or uncooperative childien aie especially 
unstable If maladjustment is continuous, the child's test score and his gen- 
eial pcifonnancc may be constant, at an impaiied level, But if the causes of 
emotional disturbance aie lemcdicd, drastic changes m IQ occm, Long- 
langc planning on the basis of the IQ is justified so long as two pieeautions 
aie observed: Inteipietalion must eonsidei the elements in the child’s baclc- 
giound which would tend to uuse 01 Iowa scoies, and all judgments must be 
made tentatively, leaving the way open foi a change of plans when change 
in development appeals The case of Danny (Lowell, 1941) should make 
cleai the hazaids that await the psychologist who tieats cveiy IQ as immuta¬ 
ble 

Danny was bom Januaiy 15, 1929 lie entered kindergaiten at the age of 
5 yeais and was such a misfit that aftei a few weeks he was given a Bmet 
test The following aie iccoids of the foiu tests given befoie the end of 
Giade 6, with the date of test, chionological age, mental age, and IQ 


2-2-34 Age 5-0 

5- 9-35 Age 6-4 

6- 8-37 Age 8-5 

12-3-40 Age 11-11 


MA 4-2 IQ 82 

MA 6-2 IQ 98 

MA 9-4 IQ 111 

MA 15-9 IQ 132 


The Hist lest showed such mental immatuiity that Danny was excluded 
fioin kmdeigcU leu for a yeai The, next yeai he moved into anothei school dis¬ 
trict This tune his Bind seme seemed noimal, he was placed in the lust 
guide in September m spite ol his lack of social adjustment The leacheis 
complained that Danny seemed to live in a woild ol his own, was noticeably 
pool in motoi coordination, and had a wonied look on his face most of the 
time The molhei was called in, and only then was light thiown on Ins pe- 
culuiities, 

The motliei explained that while Danny was still a baby his fathci had de¬ 
veloped encephalitis. In oidei foi the mothei to woik, they lived m the 
giandpai cuts’ home xvheie Danny could be caicd foi Danny’s giandfathei 
was a lngh-stiung, nervous old gentleman who was much annoyed by the 
child’s noise and at tunes expostulated so violently that Danny became pet- 
lified xvitli leai The giandfathei’s chief aim was to keep things quiet and 
peaceful at any cost When Danny was excluded fiom kmdcigailen the 
molhei look linn fiom the giandpaients’ home 
The next few yeais wcic a period of educational, social, and emotional 
giowth foi the staivcd child He amazed his teachers with his achievement 
He became an inveterate leader and could solve anthmetic problems far 
beyond Ins giade level, He was undei a doctor’s caie much of the time and 
was also tieated by a psychiatrist because of his marked fears He made 
fnends with boys in spite of physical inferiority 
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VALIDITY OF THE STANFORD-B1NIT 




edietive Validity 


The Binet test is generally used for prediction. It is employed m estimating 
the brightness of a child 'bjin^comic^red' for adoption, because prospec¬ 
tive fostei parents wish to be suie that the child lias a good chance of equal¬ 
ing their own academic and business achievement. Another fiequcnt appli¬ 
cation is m decidinghow senous a case of mental deficiency qi retaidalion is. 
Here again, both school performance and adjustment to the demands of 
normal living need to be piedicted 

The stability of the IQ itself gives infoimation on predictive validity The 
adopting parent wants a child whose later IQ will be computable 1 to that of 
others in the family. Insofar as Binet performance at age 15 ns accepted as a 
fail sample of intellectual perfoxmance, that lest itself soives as a cuterion 
1 for tests given in eaily years 

Interest in Binet peifoimance, howevoi, lesls ultimately on its ldovance 
_.to external ciitena You will recall that the Cattell-Wisslci tests, for example, 
dropped from sight just because theii predictive validity was disappoint¬ 
ing. For the SB, there is rather little evidence to be cited in the loim of up- 
to-date formal validation Foi other tests, we will most olten cite specific 
validity coefficients which peimit us to compute quantitatively the effi¬ 
ciency of various devices used for malung the same decision. The Bind test 
Ihas been the patriarch of the tube, standing without a lival until the 
Weclislei test rec ently beca me available Many validity coefficients were 
calculatecT m the early days of the test, andTtliese results weie enccnuaging, 
Recent predicti ve studies h ave ldied almost entirely on gioup tests, which 
wCrellciived from the Binet scales. Since the individual test is nowadays 
lesGTVCcl"almost entirely for individual study of perplexing cases, data are 
not at hand for computing its predictive validity foi representative samples. 

Wo may not enteitain senous questions about the lelevance of the SB 
score to piaclical prediction Studies of lugh-sehool chopouls, ol job poten¬ 
tialities oi the mentally ictaided, and the like show that the tc'yt tells a great 
deal about the peison’s expected succ ess Toi man’s follow-up ol gifted chil¬ 
dren is a particularly good long-range predictive validation against criteria 
of school performance, attainment in adult piofessional careers, financial sue- 
' cess, marital success, and adult mental health. 

When validity coefficients are calculated, the results aie always much the 
same. Here, for example, are the correlations in one high school between 
SB IQ in Giade 9 and achievement tests one year later (E. A. Bond, 1940, 
p. 29): 
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With reading comprehension 

.73 

With leading speed 

43 

With English usage 

.59 

With history 

59 

With biology 

.54 

With geomehy 

.48 


All studies show disciepancics between Bmet peiformance and attain¬ 
ment, even though the piedictions are right in the majonty of case's. Some of 
Tei man's blight boys—not very many—failed in college, or served a prison 
teim, 01 had unhappy mamages and caieeis. Intelligence is only one facet of 
individuality to be considered m a practical decision about a child or adult. 
One can neithei pi edict beliavioi of a person knowing only his IQ nor make 
a sound picdiction without using a good estimate of his mental ability. 

16. The correlation of SB IQs with grades of medical-school seniors was found to 
be only 15 in one study. The average IQ of these men was 131 (Mitchell, 
1943). Explain why the correlation was so small. 

Construct Validity: What the Test Measures 

Tire most important questions to be asked about the test are- What varia¬ 
bles affect pcifoinnincc? What docs th e c onstruct "general mental ability” 
mean? Since most subsequent general mental tests haveT>een"made to have 
higlTcbrrelalions with the SB, statements about the meaning of general abil¬ 
ity apply equally to these tests 

Looking at Table 16, we sec that the test items do fit Binet’s definition of 
intelligence, in that they call foi ability to maintain a definite set, adaptation, 
and self-cnticism. That the items all depend on some common element 
which we can call geneial ability is indicated by the fact that each item cor- 
lelates with the total test But a thoioughgoing analysis must do more than 
accept items because they include an element we wish to measure. An 
equally lmpoitant question is . What elements affect the score that are not 
consideicd m the definition? Logical analysis plus experimental studies have 
led to several impoilanl conclusions, 

vf-tf The Slanfoid-Bincl measures present ability, not inborn capacity. Al¬ 
though it seems obvious that no test can measure anything but behavior herc- 
and-now, there has been much confusion dunng the past foity yeais because 
to many people “intelligence” means inhented ability. While there is neces¬ 
sarily an inborn potentiali ty, the test measures only piesent ability which is 
affecfecTbothTSy innate f acto ns and by experie nces Bmet himself never con-' 
siclered that Ins tests measuied innate capacity alone If a user w ish es to.in¬ 
fer that a difference between SB scores of two children represents an innate 
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difference, he must assume that the two ehildien have had much the same 
experience cluimg their lives. If a child lias had the same opportunity as 
noimal children to acquiie skills, information concepts, and woik attitudes 
called loi by the tests, his failuie to come up to noimal peifoimanco can 
reasonably ho intcipietod as showing that he I ailed to piofit fiom his op¬ 
portunities The ability level can be changed by radical changes m early 
environment (S A Knk, 1958), although we have found no general tech¬ 
niques for “mental orthopedics” (Binct’s phiase) which will accelerate sig¬ 
nificantly the mental development of noimal cluldien from noimal homes 
We can list endless valuations m expcnence that would make it easier for 
one child than another to peifoim the Bind tasks, even if the two have equal 
native ability Freddie is only 5, but bis lather has plaved munboi games 
with lum so that he can count and add vciy well Haiold's motliei did not 
like having hei walls maiked, so she i closed to let Ilaiokl use pencils oi 
ciayons except undca hoi supervision, Hiuold, at 7, doesn’t seem to enjoy 
diawiug and is clumsy at it Saiuh lived m a lemotc mial ansi, wheie she 
novel saw tiiuns oi telephones Betel’s paienls aie inmiigiants, although 
both patents can speak English, they find it difficult and use then native 
language at home Fiances has a set ol books which include interesting puz¬ 
zles, pieluies containing absiuchtics, and piclmes to compute I or similarities 
and differences. Such vauations in expenenee as these me common and may 
be counted on to modify both lest performance and school peiloimauco. 

17 List for each of these children some of the tests in Form L which they would 
find easier or harder than children with "normal" experiences 

_ ’ ® Stanfoid-Blnct stoics cue shonghj weighted with verbal alrtlilies The 
great major tty of test rtenrs call for facility In 'usThg~aiuI un clef standing 
woicls If the child does pooily at these tasks, lie probably will do poorly in 
othci verbal activities He may do badly on the test because of pom school¬ 
ing, but this will also cause him to do badly m school m the fultue 7 lie- Buiot 
test is an excellent measuie of scholastic aptitude, i.e , ol leadmess to do the 
soi t of tasks inquired in school ySmce Bind ongmally sought tests which 
would clislinguislr pupils judged sujreuoi m si bool peiloimame bom those 
judged mfenoi, it is not suijuising lli.il the final lest measines an ability im- 
porlunl m schoolwoik fjf one wcnc to examine intelligent acts outside of 
school, verbal facility might be found less imjroitant The lest is not a meas'- 
rue of all types of mental ability, allies note that it undeiemphasizes insight, 
foicsight, onginality, organization of ideas, and so on A Ingb seme on the 
test should not be mtcipieted as guaranteeing the qualities which the test 
docs not measuie ) 

V Among the pupils for whom this veibal loading pioduces an unfan pic¬ 
ture of overall intellectual peifoimance aie bilingual children, children fiom 
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homes wheie English is little used, cluldicn with heaung deficiencies, and 
pool icadeis)The cxammei can often identify such cases by their spieacl of 
successes and failuies, with success on nonlanguage items at levels much 
beyond then fiist failuie on veibal concepts liable 22 compaies cluldicn 


TABLE 22 Mean iQ of Monolingual and Bilingual Children 
on the Brnet Test and a Performance Test 



Mean IQ for 

Mean IQ for 



Monolingual 

Bilinguals 


Test 

(N = 106) 

(N = 106) 

Difference 0 

Stanford-Bmet 

98 7 

[W 

78 

Atkins Ob|ect-Fitting 

89 0 

mm 

-85 


“ SitfnifU met’ tosts show that ucitlu 1 tlilluence could be due to chance 
Sou Lie i D«uc>, l l ) 16 


wbo speak only English with bilinguals who speak a second language at 
home Both gioups weie tested on the SB and the Atkins Object-Fitting Test, 
a pcifoimaiKC lest foi pieschool dhlichen which docs not demand facility in 
English, ft is evident that the bilingual gioup, supenoi on the nonveibal test, 
would be judged mfeiioi oil the SB. 

18 Which items in Table 16 depend upon previous school learning? 

19 Which items would offer an advantage to a child from an upper-class home 
compared to a child from an impoverished working-class home? 

20 Suppose one is faced with Binet's original problem, of deciding whether a pu¬ 
pil failing in school could profit fiom the regular curriculum If the pupil is 
bilingual, would the SB or a performance test serve better'? 

vJ • The SlanfanL-Bincl score measures somewhat different mental abilities 
at djj^cicnt ages This shift of emphasis is appaient m Table 16 Eaily tests 
call foi judgment, discrimination, and attention Veibal tests and reasoning 
play a much gieatei pail in latei ycais If all the tests mcasmed geneial in¬ 
telligence equally well, this- would be no pioblem But the simple mental de¬ 
velopments of eaily childhood piochct only longhly the latei emeigonce of 
veibal and liighei mental abilities While the eaily levels ol the scale aie 
excellent loi identifying cluldicn w'ilh abnoniuilly slow development, they 
do not jnediel nccuialcly the subject’s latei standing 
The eleaiest study on this comes fioin Mauiei’s woik with the Minnesota 
pieschool tests, winch me sinnlai to the early Bmct levels Mauiei lollowed 
a gioup fiom pieschool yeais to late adolescence and letested them to de¬ 
termine what preschool test items had best piedicled intellect at maturity 
At maturity she used a gioup test heavily weighted with veibal mateiials, 
this test is also highly con elated with both Bmet scoies and school success 
She found that many items which conelaled well with the rest of the pie¬ 
school scale weie pooi piedictors of latei development Among the poor 






184 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


piedictors were pointing out parts of the body, obeying simple commands, 
compiehension, and paper folding (Maurer, 1946) 

Measures of an ability made at the time when it is first being developed 
are geneially poor predictois. How caily a child learns to count, fox example, 
depends upon accidental faclois as well as upon his biightncss. Many pu¬ 
pils who stait latci will overtake the one with the best eaily porfoimance, 
Stable measuics thcicfoic must be based on abilities that aio alieady well 
formed. Thus John E. Andeison (1944) aigues that vocabulary is a 
good test for oldei childien just because it is based on a long pound of en¬ 
vironmental stimulation. Mamei (p 86) confiims this m hei search foi tests 
of young children which will predict latei IQ 

[Good] tests foi youngei childien make only minimal demands on 
language They inquire peiception of form and spatial lclalionships 
and the ability to leproduce them. They do not demand complex motor 
coordinations Tliey lequne contiolled attention and ability to peisist to 
a goal. Many of them aie compaiatively independent of tiainmg Tests 
foi older children [4-5 yeais] involve use of language m relationships 
which aie not often piactieed and constitute' piobletiKsolvmg situations 
involving the use of well-developed tools 

K^'f'Thc test requites experiences common to the U S’, urban culture and is 
of dubious value for computing cultural gioups. The Zufii Indians, foi exam¬ 
ple, have a coopoiative society most unlike the competitive attitudes wo 
tend to cncomage Zufii childien have laces But a child who wins several 
races is censuied foi having made others lose face lit 1 must lenui to win 
some races to show he is capable, and then to hold back and give otlieis an 
oppoitunity to win. In aiilhmctic, white teaehcis sent Zufu childien to the 
blackboaid foi antlimctic dulls, with instructions to do a pioblem and turn 
theii backs to the board when finished Instead, the pupils faced the 
boaid until the slowest had finished; then all tinned. Tins was to them sim¬ 
ple eomlesy; following the teuchei’s diicclion would have been exhibition¬ 
ism It is easy to see why the typical American speed test gives misleading 
results among the Zuni^A Bind test faics no bettei, the {list sulijec t may lail 
some items deliberately, because he fears the next child will be unable to 
answci. All intelligence tests face the same pioblem, they are adequate, only 
for compaiing peisons with similai experienec^Anglo-Ameiicam would per¬ 
haps do badly on a test developed by a Zufu psychologist, using questions 
which differentiated between good and poor members of Zufii culture^ 
J > ® The Bind test does not give a reliable measure of separate aspects of 
mentality Scores are influent''’ *->< n- v r n ■-! r 1 ■h*ie- It 

would be helpful m diagnos.- i ■,•■(! o\ di l , 1 -, i'. ne . jt -ii,» seg¬ 
ments and obtain separate estimates of verbal ability, information, and so 
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on. Any particulai specialized ability is used m only a few items; therefoie 
combining those items would not give a leliable measuie of the ability Binet 
and Taman dchbeiately sought a great vanety of tests, so that no one sub¬ 
division of intelligence would have gicat weight m the final scoie. This fea- 
tuie makes the SB unsuitable for measuring the aspects of ability sepa¬ 
rately. 

21. What sort of items have the greatest correlation with total test score at levels 
11-6 and IV (see Table 16)? What does this suggest regarding the meaning of 
"general mental ability” in preschool applications of the Stanford-Binet? 

22. What items have the greatest correlation with total score at the upper end of 
the scale? What is the meaning of "general ability" at that level? 

l^M -The Bind scoie tsjinfluenced by the subject’s personality and emotional 
habits Bmel’s dcsciiption oF7nTclhg<mce"inclucles r persistence,' flexibility of 
mental appioach, and cutiealness, all of which aie aspects of peisonahty 
Among the emotional habits which have an obvious effect on scoies are shy¬ 
ness with stiango adults, lack of sclf-confidence, and dislike foi “schoolisli” 
tasks A sell-cntical poison may say “I don’t know” because he is dissatisfied 
with the best answei he can foimulatc; a peison less sensitive to niceties 
may give an answei which is passable. A pedantic uige to accuiacy may 
make it relatively easy to do memoiy tasks. Fear may cause a child to 
“ficcze up” so that lie cannot find a new mode of attack when his fust one is 
blocked No mattei how caietul a testei is, theie is some dangei that a child 
may fail an item that he could have passed if ability alone wcie requited. 
One should therefore always beai m mind that the final test scoie shows how 
well the child functioned at this tune, this scoie may be maikedly affected 
by emotional complications ) 

Ilutt (1947) points out that the child encounteis considerable fiustiation 
fiom a succession of faihues, and that this stress comes at chffcienl points for 
different children, He proposes to "standardize” this stiess by alternating 
easy and haul items. In an expenmental tiial with compaiable gioups, he 
found that veiy well-adjusted children earned the same IQs on Ins "adap¬ 
tive” piocedme as on the usual test The badly adjusted cliildien, however, 
aveiagod 4 5 points higher in IQ with the adaptive method. 

Any clepailuie fiom standaid administrative practice changes the mean¬ 
ing of scoies. It can be readily seen that Hull’s method will yield a higher 
avenge IQ than the Terman-Merrill pioceduic Many testeis who are not 
willing to take the radical step Ilutt pioposes, which places a burden of judg¬ 
ment on the testei, aie nonetheless disliessed by the fact that the child en¬ 
counteis more and moie failures, ending the test with no less than six failures 
m a row This damages the clinical lelationship and influences the subse¬ 
quent tests. To avoid tins outcome, and also to simplify administration, they 
favor "serial administration,” in which all memory-for-digits items, for exam- 
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pie, are piesented together Teiman and Memll arrange items by difficulty 
lathci than content, and then dnections insist that tins Older be followed 
Nonetheless, a good many testeis have changed ovei to the more conven¬ 
ient serial plan, pointing to the evidence of Fiandsen and others (1950) 
that the mean IQ is the same foi the two techniques. 

23. In indicating the importance of ob|ective mental testing, Terman says, “| 
believe it is possible for the psychologist to submit, after a forty-minute diag- 
nostication, a more reliable and more enlightening estimate of the child's in¬ 
telligence than most teachers can offer after a year of daily contact in the 
classroom " In which of the following features does the advantage of the test 
over the teacher’s report lie? 

a Freedom from personal prejudice 

b. Considering more aspects of mental ability. 

c. Considering a basically different trait 

d Observing capacity rather than level of actual performance. 

e. Sampling behavior under a wide range of conditions 

f. Permitting an exact comparison of the child with a standard of normality 

24. How could you decide whether Hutt's “adaptive" procedure is more valid than 
the standard method? 

/diagnostic interpretation 

Chilclien wilh the same MA aic of com so fai from alike in mental develop¬ 
ment, as is shown by the fact that they pass quite diffluent tests The Stan- 
foid-Binct, as a standaiilized but complex, situation, bungs to light fai moie 
individual differences than the single scene lepiesenls Kxpeiionood testeis 
always study such diileienccs, and many have tiled to develop supplemen¬ 
tary systems of seoiing to rcpoit tins inhumation In paititular, many have 
hoped that the scattci of peiloimance would have diagnostic value. The 
scatter is the nmge fiom the child’s caihosl failiue to his highest success, it 
suggests wholhci all aspects ol ability have developed evonh Aflei many 
studies of scattci, investigates now ugice that it has no value as a seme. All 
olliei attempts to obtain diagnostic stoics fiom the Stanloid-lhnel have sim- 
lhulv I ailed 

The SB test will not yield nieaninglul diagnostic semes because it was de¬ 
signed to pi event any lactoi save “geneial ability” limn influencing semes to 
a moasimiblc* degiee We cannot tiaee aeemately the child’s development 
m simple lecall, foi example, because digit-span and othei iceall tests aic 
not umfoiinly spaced at all levels ol difficulty. Even within one yem-scalc, 
we cannot discuss the child’s strengths and weaknesses with confidence, be¬ 
cause tests giouped together do not have exactly the same difficulty 
| Neveithcless, the psychologist ought to study the detailed pattern of test 
Ipeifoimance If a child has an unusual handicap oi facility m verbal tests, the 
examinei has an excellent oppoitunity to note it. Deficiencies m mlormation, 
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arithmetic skill, and reasoning may also be noted. A distinction should be 
made between the child successful because of coaching, who does well on 
such teachable items as counting to 13 or saying the days of the week m ol¬ 
der, and the moie genuinely intelligent child of the same age who can 
make up a coheient sloiy about a picluie and tell what day of Lhe week 
comes befoie Tuesday. These indications, even if biought to light only in one 
01 two subtesls, pi ovule piofitable leads for fuither study. They should be 
confirmed by lelnible tests of the sepaiate abilities. 

The SB affoids an excellent oppoitunity to_see how the child woiks An 
impulsive child will be obseived to use tnal and en or m an attempt to 
“force” a solution instead of leasomng An inhibited child may lefuse to take 
a chanaToriTiciris~wlicic induction'or imagination is called for and he can¬ 
not be positive that Ins answci is light Olheis give answeis even to questions 
about which they aic ignorant. 

The outcome ol caicful clinical study is illustrated by the tester’s com¬ 
ments regaiding John Sandeis, a noimal adolescent (age 12-8, IQ 109) 
(H E Jones cl al, 1943, pp. 91-92) 

John showed a lively intellectual curiosity and was mlcicsLed m a vanety of 
things, hut within each of these mlciests his attention seemed to be ugid and 
single-tiaeked ’Ibis lack of flexibility made it difficult foi him to adapt to le- 
quiicrncnts when on uiifaiinhai giound Upon eneounteiing difficulties, he fic- 
queullv demanded a pencil, because he could not “see” the wouls or numbeis, I 
have novel tested a moie eye-minded poison 

John’s piincipal difficulties weie on tests lequmng piecise opeiations, as in the 
use of numbeis With such tests he became insecuio and often seemed confused, 
with slips ot mommy and cnois in simple calculations He asked to have msliiic- 
tious lepeated, was dependent on the examiner, and easily discouraged Although 
coopeiative and anxious to do well, it was extiemely hard foi him to mastei a task 
(such as "mommy span”) m which he was requned to be exact by fixed standards 
If this is also true outside the testing situation, it is not surpusing that m his school 
woik he has found gieat difficulty m learning to spell, m masleung the mechanics 
of English, and in leaimng a loieign language We cannot tell fiom this test why 
lie has had such unusual difficulty in this kind of learning Ilowevei, the supposi¬ 
tion can be offoiod that m tasks involving an imaginative and analytic appioach 
he imposes 1mm upon himself, m tasks of the type which he finds difficult, foim 
is imposed upon him limn without. Resistance to such conliols may account m 
pail fm the disctepumics between John’s actual intelligence and his achievement 
in cel lam fields 

Attempts have been made to identify emotionally disturbed peisons by 
the pattern of thou subtest scores, and abnoimal gioups do show some dc- 
partuie irom noimal aveiages Myers and Gifford (1943), for example, find 
scluzoplnemes superior in vocabulary, abstiact words, and dissected sen¬ 
tences, compared with normals of the same mental age, but much poorer on 
bead chains, pictuie absurdities, and memory foi stories. Knowledge of such 
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averages is useful background for the tester, but many normals show pat¬ 
terns which are just like those of typical patients. The SB pattern cannot be 
used to make a definite diagnosis. 

Responses sometimes reveal disturbances of thinking. Fcifel (1949) has 
found that mental patients and normals respond to vocabulary items m dif¬ 
ferent ways. Normals tend to use synonyms, while abnormals give defini¬ 
tions by use and description, explanation, and illustration. Asked vvliat an 
envelope is, normals said “a container,” “a lcccptacle foi paper,” "something 
to put a letter in,” etc, Typical patient responses were “a piece of paper you 
fold,” “you write letters,” “it’s sticky on top so you can paste it down,” and "to 
mail.” 

Responses may disclose values and altitudes. Stiauss (1941) asked men¬ 
tally defective delinquents, “What ought you to do before under taking 
something veiy impoitani?” (Ycai X) Then answers included "Don't 
touch anything that doesn’t belong to you,” or “Run away from a guy who is 
going to take it Go tell lnm nothing of the people that owns them ” In de¬ 
fining pity, one of them answeied, “Don’t take pity on somebody, shoot them 
and kill them.” 

Es sentially the SB is a stand aidi zed clinical o bservation. The fact that it 
yields an IQ should not blind the tcstei to Ins obligation to report everything 
he can obseivc. There is no adequate rationale loi making and mteipieting 
these observations, and the findings aic n occs sanly tentative But to avoid 
them because they aie subjective is no moie sound than if the psycholo¬ 
gist were to lefusc to have a conversation with the child because it would not 
lead to a statistically manageable and reliable seme. The Buret tester with 
adequate experience has a great advantage over the clinical lnleiviewei be¬ 
cause he can observe the child in a standardized situation and can compare 
what he does with the behavior of other children. The fact that the child 
does not realize that the test situation reveals Ins emotions and habits of 
work is a fuither asset. 


25 What sort of report should be placed 
been given the SB? 


■'^Sen 


eneral Evaluation 


in the school files for a child who has 


The Stanfoid-Binel scale is an instrument efficiently designed for one pai- 
ticulai function, namely, providing a single scoie deseubing the child’s pres¬ 
ent level of general inte llectual ability It is int ere sting to the child, piecise, 
and well stand ardized. The large amount of research on the scale gives a 
basis for interpreting lesults which no newer lest can oflenXThe 1960 revi¬ 
sion makes an important improvement in discaiding the conceptually and 
statistically un satisfactory ratio IQ, on the new scale an IQ o f a given size has 
the same interpretation at all childhood ages The revision retains those 
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items from Foims L and M which made the greatest contribution to the total 
score The new scale places greater emphasi s on w or d knowledge than 
the older forms. Items whose content is peripheral to scholastic aptitude 
were eliminated, to provide a more concentrated measure. The gam in ac¬ 
curacy is offset by some loss in vanety of items and m oppoitunity foi obser¬ 
vation of mental processes. Although the levision eliminates the comparable 
foims of the 1937 veision, this is a small loss. Few testers made use of 
Form M, and the Wechsler scale may be used for cases where a second meas¬ 
urement is requiied to confirm a doubtful Bmet score 

The Stanford-Binel finds the Wechslei scale for children a strong competi¬ 
tor. The chief diffciences are in 01 ganization, in the greater precision of the 
Binet at low mental ages, and in the gieatei variety of tasks m the Wechsler 
scale The difference in content of the two scales is magnified by the L-M 
revision, which nairows and focuses the SB, Dehbeiate concentiation on 
veibal and educational abilities is an advantage for some puiposes, a disad¬ 
vantage foi olhois, as is made apparent m E L Thorndike’s comment 
(1921): 

If the hoy has had oidinary American opportunities, this score [m 
standaidi/ed tests of the Bmct oi of the gioup test type] will prophesy 
rather accuialcly how well lie will lespond to intellectual demands in 
cases' of “book-learning” at the time and foi some time theieafter, and 
vciy possibly for all his life. It will piophcsy less accurately how well he 
will respond in thinking about a machine that he tends, crops that 
he glows, merchandise that he sells, and other concrete realities that he 
cncounleis in the laboiatory, field, shop, and office. It may prophesy 
still less accurately how well he will succeed m thinking about people 
and theii passions and m responding to these. 

Such objections as this have led clinicians to combine the SB with perform¬ 
ance scales The SB does not measure all aspects of mental ability, nor does 
it measure mboin capacity. It is propeily interpreted as a measure of present 
status in one important type of mental development 

26. The Stanford-Binet has been criticized because it contains numerous items re¬ 
lating to death and other morbid subjects What has this to do with the value 
of an intelligence test, so long as brighter pupils pass these items? 

27 Terman, in revising the original Binet scale, discarded items which showed a 
consistent difference in favor of either sex. His argument was that a fair com¬ 
parison could not be made if items favored one sex or the other. Did the 
elimination of such items raise or lower the validity of his scale? 

PERFORMANCE SCALES 

Individual tests owe their prominent place in testing of problem children, 
psychiatric patients, and the mentally retarded chiefly to their value as a 
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situation for observing performance. Any individually administered test of¬ 
fers some oppoitunity to observe the natuie of the subject’s enors, habits of 
performance, and emotional leactions, Pcrfoimance tests such as Block 
Design, however, pcimit especially leveahng clinical obseivations. Moreover, 
they depend veLy little on language and schooling, winch makes them suit¬ 
able foi evaluating young children, adults with limited schooling, and per¬ 
sons unfamiliar with the language of the testci. 

The Stanfoid-Binet includes some performance tests, diawing, bead- 
stnnging, fitting blocks into holes, etc These tests are 1 datively few m num- 
bei, are concentialed at the easiei levels, and in geneial do not icepme com* 
plex reasoning Teiman equated intelligence to “the powei of abstract 
thought,” and therefore most of his items involved veilul or numeiical con¬ 
cepts. While veibal items do have considerable piedictive power, especially 
for educational ciiteria, clinical tcsteis need moie elaboiatc pcifoimance 
scales than the Bmet offcis 

Pcifoimance scales give somewhat difleient information from that yielded 
by the Stanford-Bmet Figme 32 shows a comparison of ten ohildion on 


Rank Order 

IQ Binet IQ 

Rank Order, 
Average 
Performance 
Test Score 

161 Margaret — 


- Douglas 

143 Douglasl 

X __-—* ' ^ 

-Amy 

143 Amy j 

■"X. __ 

- Christopher 

134 Carol- 

132 Christopher 


-Virginia 

129 Virginia- 


-Dick 

- Walter 

120 Allison- 

110 Mark-- 


• Carol 

109 Dick- 


- Allison 

- Mark 

104 Walter- 


- Margaret 


FIG 32 Rank of ton superior 7 -year olds on tho Binot tost and 
on a battery of performance tests (Bibor of a l. 1952) 


the SUmfoi d-RinoL and on a composite of thioe peiloimanee tcsls, There are 
seveial shifts in lank older, the most sinking change being M.ugaiet’s. 
She is highly supeiun m schoolwoik, but she has a heavy build, arts slowly, 
and is socially awkward with adults Ilei difficulty in pci humane c tests may 
thciefoie lefiect personality pioblems that limit hei effectiveness in many 
life situations 

A sample of the information the pcrfoimance test yields to a skilled ob¬ 
server is indicated by the following record Maik, 8 years old, had a Binet 
MA of 8-8 On some performance tests, however, he reached an MA of 9-6 
(Biber et al , 1952). 
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The most striking feature of Maik’s examination was his extiome lack of confi¬ 
dence and his desne to do what was expected of him This was manifested by his 
constant leference to the examinei Throughout all of the tests, although he said 
little, it was evident that he was lefemng back to see whether the expi cssion on 
the examinei’s face indicated appioval 

In the Healy Completion [fitting small blocks into holes to complete a pictuie] 
the examinei noticed that once when she gave lnm a fucndlv smile he was content 
to leave an infciioi solution, as if lie vveie guided much moie by his wish to please 
than by his own good intelligence Although she busied lieiself with papers and 
tiled to pav as little attention as is compatible with a lest situation, it was impos¬ 
sible to picvent this The ducctions in the Healy Completion to look tlie woik over 
caiefully and see if there aie any changes to make seemed to imply cnlicism to 
Maik, and he icrnoved a block which was conectly placed and substituted a 
blank IIis fiist lesponses weie all good In this test, he placed the fiisl tlnce ac¬ 
curately, then, nppaiently, lie began leelmg anxious 01 unceitain, and the last 
tlnee lie placed weie blanks It seemed that he was using the blanks as a way of 
avoiding committing lumself to a mistake, and that be felt thaL he would lathci do 
nothing than to gel the wumg lesull This test was the most plainly motivated by 
his desne foi appioval, although them weie indications of it thioughout the othei 
tests as well 

In the Pinlnei-Paleison senes, he seemed to lie less conscious of the examinei, 
probably because he felt moie sme of himself in those tests When he was un- 
ceilam, as m the Slup and Titangle Tests [foimboaids], he would look up shyly 
as lie walked Severn] tunes lie commented, "That’s easy ” 

The liisi pail of the [Poitous] m.i/e senes ho enjoyed, woiking quickly, accu- 
lately, and with ease Aftei his fust failmc m Yeai VIII, he seemed much moie un¬ 
certain and slow. Aflei piactically cveiv one he s.ud, “I’m not going to do any moie 
of these” With constant encomagemeiit, he went on and completed Yeais X, XI, 
and XII, although lie had foul Li nils on Yeai XII Towaid the end of the senes, 
thoie was little ewdence of ieal elTmt on Ins pail, but lather he seemed to be going 
tlnough the motions because the examinei uiged lnm on 

Piobably no lest lcsulls on Mark am comjrlotcly accuiatc because othei factors 
besides ability aie so definitely involved in his beliavioi Difficulty chd not stimu¬ 
late linn, as it did Douglas and Amy, foi instance, but simply chscouiaged him and 
left him tense and uneasy lie was icsponsise to puuse, but always with a ques¬ 
tioning expiession, as if lie were hying to fenel out what one leallv thought of him 
It was coiisisleiil with lus total defensive attitude that lie ofleicd veiv little mfoi- 
malion dining the test and soveial limes he lesponded veiv shoilly to questions 
that the examinei pul to him 

28 What light do these observations throw on the interpretation of Mark’s 
Binet IQ? 

29 List several of the characteristics a tester should attempt to observe in giving 
an individual performance test 

30. In many clinical examinations, only part of a battery of peiformance tests is 
used, to save time On what basis would you decide which subtests to retain? 

THE WECHSLER SCALES 

The most impoitant peiformance tests today are those included m 
Wechslei’s intelligence scales His effort at test development began at the 
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Bellevue Hospital operated by tbe city of New York, where social derelicts of 
many sorts had to be tested. These persons might be feeble-minded, psy¬ 
chotic, or illiterate, estimation of the intellectual level of each case was im¬ 
portant in dcteiminmg his disposition. Wechslei prcpaied the Wechslei - 
Bellevue Scale Foim I in 1939 to provide foi such clinical evaluations. This 
scale was of gicat value m militaiy hospitals duung World Wai II and be¬ 
came one of the chief tools of the clinical psychologist aftci the war A sec¬ 
ond form was published in 1946 but was never adequately standardized 
The “Wechslei-Bellevue Scale” is now obsolete, having been mplaced by 
better-constructed and belter-standardized forms. Today’s Wechslei senes 
consists of WAIS, the Wechsler Adult Intelligence Scale (1955) foi ages 16 
and above, and WISG, the Wechslei Intelligence Scale for Children (1949) 
for ages 5-15. 

Stiange ironies attend the lnstoiy of test development. Bmct set out to 
identify mental defectives, yet the most famous piece of lcscarch with his 
scale was concerned with childien of supenor endowment Wcelisler tiied 
to prepare a new type of mental test foi adults, because adults and children 
differ in their inteicsts and appioach to work. Yet today lus technique is 
popular as a childien’s test. Ills secondary hope in developing the test was 
that patterns of subtest scoies would provide a leady means of clinical diag¬ 
nosis. The hope was not realized, and this type of analysis is no longci de¬ 
pended upon because empincal checks show that pattern analysis has little 
validity Weclislci’s senes is now of chief impoilance as a geneial individual 
test for all ages 


Test Materials and Procedure 

Wechsler collected a vanety of items, many of them from previously pub¬ 
lished tests. He subscribed to Bincl’s idea of a geneial mental ability, but his 
experience suggested that m the mental patient some types of poiformance 
or leasoning aie moie disluibed than others. Wechsler gave 1 prefeience to 
items which he had found useful m understanding the iutelleelu.il function¬ 
ing of patients. Wechslei sought items which, while falling within the area 
we identify as geneial mental ability, had sufficiently speeifie el lai act eristics 
to silhouette difTcient types of thinking oi peifoimance 

In contiast to the Teiman-Binel plan of gumpmg items according to diffi¬ 
culty, mixing content randomly, Wechslei ariiuigcs them into subtests of 
various types There are eleven scoied subtests, gioupod in a Verbal and a 
Performance scale The Veibal scale includes tests of Inhumation, Compie- 
liension, Digit Span, Similarities, Arithmetic, and Vocabulary The Perform¬ 
ance scale includes Pictcue Anangement, Pictuie Completion, Block De¬ 
sign, Object Assembly, and Digit Symbol tests. 
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We shall descube the Veibal scales only biiefly and then turn to the Per¬ 
formance tests Items arc taken from the WAIS form (Wechsler, 1955). 2 

Information includes such items as “What is the population of the United 
States?” “What does lubber come from?” and "How many weeks are there 
m a yeai?” Compieliension questions include “Why should we keep away 
from bad company?” and “What docs this saying mean? 'Shallow brooks are 
noisy’” The subject is expected to give a generalized, fairly direct answer. 
In Digit Span, the subject is asked to lepeat digits foiwaid and backwaid. 
The Similarities scale asks the subject to tell how the following are alike: 
orange and banana, air and walci, poem and statue, etc Arithmetic is a test 
of numerical reasoning ability using simple veibal pioblems, such as “How 
many oiangcs can you buy foi 36 cents if one oiangc costs foui cents?” The 
subject is lcquued to do the items mentally and lcceives no ciedit on an item 
wlieie he uses moie than a icasonable time (e g., thnty seconds foi the ques¬ 
tion about oiangcs) Vocabulaiy requires the subject to define oi explam 
such woids as “fabiic,” “conceal,” and “tirade” 

The Block Design test was clesciibed m Chapter 3 Materials used in sev- 
eial ollitii Peifoimanee scales aie lllustiated in Figuie 33. Whereas Block 
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FIG 33 Material* from Wechsler Performance tests. (Copyright ©, 1955, The Psychological 
Corporation Reproduced by permission.) 


Design lequircs analysis of a complex whole, bieakmg a pattern into ele¬ 
ments, Object Assembly gives the paits and requires the peison to discover 
how they go together The foui tasks are the piofile, manikin, hand, and ele¬ 
phant Time bonuses for rapid performance are allowed 

2 Items quoted m this section copyright © 1955, The Psychological Corporation Re¬ 
produced by permission. 
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Digit Symbol iccjuues tlic peixon to fill in the proper code symbol under 
each number, doing as much as he can m a slant time. Our illustration 
shows only live symbols; the actual test uses ten The code mnains m hunt 
of the subject as he woiks Thus he can continually lefei to the code, or he 
may cairy it in his head. Learning the code is easy enough that for above- 
aveiagc adults' the scoic becomes a measure of wntmg speed lather than 
mental ability 

Theie aie two pictuic tests Picture Completion uses items piosenlcd on 
caids, each showing a picture lioin which something is nussmg The subject 
tells what is lacking In Pictuie Anangcment, a stoiy is told in llneo 01 more 
cartoon panels which aie piesenled m random oidei, the subject must piece 
them together in the coned oidei Ileic again, the subject must identify a 
complex whole Eiom disorganized pails 
The W1SC senes is a dowmvaid extension using easiei items than WAIS. 
The same subtests aie used, hut DigiL Span is an optional test hu childien 
because it has a low coil elation with oveiall peiioinwnco. A Ma/o test is 
added as an optional peilonnanee test Coding a simple message, a task 
used by Teiman, is substituted loi the moie dilfieull Digit Symbol task, 
The We eh si ei scales aie comparatively simple to adimnislei, the full 
WAIS requinng about one lioui The dnedions aie less coiujilcx than those 
foi the Bmet, and keejung snmlai items together 1 educes the task of the ex- 
aminci The skill oi the exammei may lnflueme the suin' gieatlv hi some 
of the verbal tests, the examinei must make lalhei sensitise judgments as to 
the coneelness ol an anssvei since it may lie net essays to lequesl the sub¬ 
ject to elaborate lus meaning Answeis that seem ssiong mav he idirect 
when the subject explains luinselt Subjectivity in scaling boideihue answers 
is also a potential pioblem. 

31 Does the Digit Symbol test call for the same mental processes when three digit- 
symbol pairs are used as when ten are used? Does Wechsler’s Digit Symbol 
test call for the same processes from bright and dull subjects? 

32 Which Wechsler subtests have the following characteristics’ 
a The score is affected by educational background 

b. The test demands experiences found in ihe urban American culture 

c. The tost requires pioblem solving or reorganization of knowledge rather 
than mere recall 

d The test measures very simple mental processes such as Cattell and Wissler 
investigated 

33 How do the Wechsler test items differ from the higher levels of the Stanford- 
Binet’ 

Meaning of Wechsler IQs 

The law scores on the subtexts aie converted into scaled scoics, i e., nor¬ 
malized standaid scoies with a mean of 10 and s d of 3 This conversion for 
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WAIS is based on a reference group of adolescents and adults of each age, 
carefully chosen to match the census distribution on sex, geographical region, 
uiban-iuial lesidenee, lace, occupation, and education A similai sample 
(resliicted to white cluldien) was used foi the WISC 
Wechslei mtioduced slandaid-scoie IQs in his Hist edition, anticipating a 
piactice which Teiman and Memll latei accepted He chose to fix the 
mean at 100 and the standaid deviation at 15 The disciepancy between 
Wechslei standaid scoies and those developed with the Terman-Mcirill 
s d. of 16 is unfoilunate, but should laiely be a source of senous confusion 
Wechslei eliminated completely the mental-age conveision, which is a 
source of some misundeistandmg m Bmet mteipietation 

Wechslei cuticized shaiply the ongmal Stanfoi d-Bmet assumption that 
mental ability lemams constant duung adulthood Mean scores on almost 
any mental test use duung eaily adulthood and decline later. Wechslei 
theiefoie developed scpaiate slandaid-scoie conveisions foi adult age 
gioups In the Stanfoid-Bmet system, wheie adult noims have not been de¬ 
veloped, a given law scoie yields the same IQ at all adult ages. In Weehs- 
lei’s conversion tables, a law scoie of 129 yields an IQ of 115 at age 16, 
111 at age 20, 11 1 at age 40, 121 at age 60, and 136 at age 80 Wechsler’s 
other mujoi innovation in sooimg was to piovide scpaiate standaid-score 
conveisions lot the 1 Veibal and Peifoimance scales 
Wechslei and SB IQs me not inlcichangeable When Bayley (1949) gave 
the 1937 SB and Wechslei-Bellevue tests to the same group of adolescents, 
the mean SB IQ was 132 and the mean Wechslei IQ only 122 The Wechs¬ 
lei s d’s weie also lowei This is confumed by the iact that in Wechsler’s 
standardization foi WISC only half as many children had IQs 130 and ovei 
as m the Teiinan-Meuill standardization Even cleaici evidence (Table 23) 


TABLE 23 WISC and SB Results for Representative Children in New 
York City 

Correlation 

Mean s d with SB 


1084 158 — 

101 2 12 8 82 

103 4 13 6 74 

98 3 15 0 64 


Smriii i I Kinuimiri ft ill . U)5L 


Stanfoid-Binet Form L 
WISC Full scale 

Verbal scale 
Performance scale 


comes fiom the New 1 oik City poilion of the WISC standiu dizalion data, 
wheie 332 cluldien diawn horn eighteen schools weie tested with both 
WISC and SB The SB IQs lan substantially highei and their s d was 
greater (J Kingman et al , 1951) Since both the WISC and SB scales were 
standaidized on caiefully selected samples, it is haul to decide which set of 
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norms is wrong, The best we can do without much moie evidence is to recog¬ 
nize that SB IQs average some 7 points higher than Wechslcr IQs duiing 
childhood and eaily adulthood. 

Weelislei’s WAIS standardization data aie consistent with his belief that 
mental ability reaches its peak in eaily adulthood In Figuie 34, we see that 



Age 

FIG 34 Changes in mental-test score with ago, based on cross sec¬ 
tional samples for the WAIS (Wechsler, 1955) An arbitrary common 
scale has been used for plotting the two scores, since raw scoros on 
one scale cannot be compared with raw scoros of the other 


the avciage Peiformance sune has its peak about age 22 and then diops very 
lnpidly. The Veibal avciage uses until about age 30 and falls off more 
slowly The Total seme, not plotted, icndies its peak in the Into 20’s. 

Although many othei studies give snmlui cuives, this lesull is no longei ac¬ 
cepted as a true pictiuc of the couise of intellectual gtowlli and decline. All 
the studies showing a diop m early adulthood are cross sectional, 1 e,, the 
average for each age is based on a diffeient group of peisons. This means 
that persons at the two ends of the chait belong to different generations and 
developed their ability under quite different social circumstances. Anastasi 
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(1956) points out that Wechsler’s older groups have less education than his 
younger samples, which may account for much of their poorer performance. 
Bayley (1955) combined evidence from three longitudinal studies in which 
the same persons weie tested on occasions as much as thirty years apart, and 
concludes that the test scores continue to rise at least until the age of 50. If 
hei curve based on limited samples (and depending on a vocabulary test for 
many of its points) shows the true pattern of growth in mental ability, 
Weclisler’s noims will soon be outdated (see also Bayley, 1957; Bradway 
et al. 1958) He piovcd that, in 1950, adults born m 1910 (age 40) per¬ 
formed worse than adults bom in 1920 (age 30) But if Bayley is correct, this 
lattei gioup will continue to giow, and in 1960 their scores will aveiage much 
better than Wechsler’s 40-yeai-old sample tested in 1950 Bayley’s data sug¬ 
gest that cultuial changes are yeai by year laising the mental ability of the 
nation 

34. A maze test is available in WISC but not in WAIS What data would one ob¬ 
tain to decide whether the maze test should be made a part of the WAIS 
scale? 

35. The age curve for the WAIS, based on data gathered in the 1950's, has its 
peak at a later point than the curve based on the Wechsler-Bellevue standard¬ 
ization in the late thirties What does this fact imply? 

36. Would a vocabulary score be more or less likely than a performance score to 
improve between ages 20 and 40? 

Adequacy as a Measure of General Ability 

The Wcchsler lest, taken as a whole, measuies about the same ability as 
the Stanford-Binct The corielation of 82 between the two tests leported in 
Table 23 is fairly lepiesentative of the concurrent validity of the Wechslei 
tests 

The coirelations show that the Veibal scale is much moie closely lelated to 
the SB than is the Performance scale, and in some studies the Verbal scale 
gives a significantly higher con elation with the SB than docs the Full scale 
Them is, then, a real psychological clifTeience between the Stanfoid-Binet 
and the bioador Wechslei. In any composite scoie, however, elements pres¬ 
ent in only pail of the test have fai less influence on the total than do ele¬ 
ments running all lluough the lest Abilities to compieliend dnections, to 
concenliale, to cnlici/.e and coirecl one’s lesponses, and to understand 
woids and pictuies refening to familiar experiences run through both the 
Veibal and the Performance scales These general abilities therefoie laigely 
determine the total scoie on both the Wechsler and Binet tests, specific 
abilities found only in anthmetic items oi pcifoimance items have some in¬ 
fluence, but not veiy much. 

The reliability of the Full scale of the WAIS is lepoited by Wechsler as .97 
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(Wechsler, 1958). This split-half coefficient based on peisons of uniform age 
is spectaculaily high, it is reasonable to expect a somewhat lower conelation 
between tests at two sittings Foi the WISC, split-half coefficients aie also 
above 90. Theic is as yet little evidence on the stability of Wechslei scores; 
we can expect the total scoie to be as stable as that fiom the Staiiford-Binot, 
with the Veibal scoie moie stable than the Peiloimance seme (Bayley, 
1957) The subtests will probably show different degiees of stability, and 
such evidence may have nnpoitant theoretical and piaclical implications. 

37. Clinical psychologists frequently distinguish between a patient's “intellectual 
equipment" and his "functioning ability " Equipment is thought of, not as in¬ 
born capacity, but as the maximum intellectual power the person could sum¬ 
mon up at this time The equipment often does not function at its best, how¬ 
ever, because of impulsiveness, inhibition due to anxiety, autistic thinking, and 
other limitations This point of view argues that people fall below their true 
potential to varying degrees 

a In terms of these concepts, what does the Wechsler test reveal? 

b. In terms of these concepts, what does the Binet test reveal? 

c. Which of these concepts comes closest to "intelligence"? 

The Verbal and Performance Scores 

The sopamte IQs foi Veibal and Peiloimance* tests moasiue chffeient abil¬ 
ities. This is shown by then (.‘emulations with the Stanlnid-lhuet, icpnilcd 
above, and by then conelation with each olliei In vai ions ago gioups Veibal 
IQ and Pcifoimance IQ con elate only 77 to 81, though then split-half ic- 
babihties aie 93 oi better. 

Most performance tests, taken singly, aie evliemely uniehablc, and even 
scales combining seveial tests may be undependable The high i(‘ liability of 
Wechslei’s Peifoimanee scale shows that lie lias done unusually fine test 
constmction Each pcifoimance item lequues longei than a veibal item, and 
it is theiefoie difficult to obtain as good a sample of ability m limited time 
Emotional blocking, caielessness, and undue baste ollen muse a prison to 
fail a peiloimance item which would ollieiwiso be easv Ini linn Wechsler 
has ovcHome these adveise lnfluenees by wuting cleu duet lions, using a 
vauoty ol tasks with seveial items ol each type, and developing pm ise scor¬ 
ing stanclanls As a lesult, lus J’eiloinianee stale is piuhably the must de¬ 
pendable nonveibal measine evei developed 

Even though the two scales aie highly leliable, the diHoieuee between 
them is less leliable (see p 287) The Wechslei Veibal and I’eifonnancc 
scoics aie quite act mate, and the diffeience between them lias an estimated 
reliability oi 74 This is high enough to justify chawing conclusions about the 
poison whose Veibal and Pcifoimance IQs diffoi by 15 points oi so. Small 
differences, howevei, cannot be taken senously 
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When a person does much bettei on peifoimance tests than on verbal tests, 
we suspect him of having a language handicap. One who learned English 
late m life, who has had a veiy limited education, or who suffeis fiom deaf¬ 
ness will peifonn badly on tests of woid knowledge and veibal reasoning, 
both because of difficulty in undeistanding the veibal test and because of 
limited ability to leason veibally. Since pciformance tasks depend very little 
on schooling and the directions use simple language, veibal handicaps re¬ 
duce the scoie only slightly. Many adults who might be legaided as defec¬ 
tive if judged only by then verbal comprehension aie able to peifoim non- 
veibal tasks at an aveiage level, 

Behind this inteipietation is the assumption that people who have de¬ 
veloped normal peifoimance ability would do equally well m veibal tasks if 
they had had noimal expci lence When one can identify lestnctive factois 
in the poison’s past history, this intei pi etation can scaicely be denied. Poor 
veibal ability is easily undcislood in the case of the child fiom a bilin¬ 
gual home, the child who has had difficulty m learning to lead, and the 
adult who chopped out of school at an eaily age Many others, however, 
show Veil ml-Peifoimance diifciences wlieie no handicap can be identified. 
The psychologist is unable to say whether such dilleiences aie due to uniden¬ 
tified background factois 01 to some innate lack of specialized veibal apti¬ 
tudes, 

While veibal handicaps aie easily identified, tlieie is laiely such an obvi¬ 
ous handicap to explain the cases whom Peifoimance IQ is well below Vei¬ 
bal Some pooi peifoimanccs aie accounted foi by emotional blocking A 
peifoimance lest demands a longei peuod of steady woik and sometimes a 
smies of tnal-and-eiioi operations, the poison who becomes upset will 
peifoim enatically. No compaiable blocking occms m shoit-answci verbal 
questions whom the poison’s failuies aie less obvious to him, although it is 
sometimes observed in the Antlimctic subtest. A painstaking, cautious pei- 
founance will lowei the Pcrlormance scoie Such undue caution is mtei- 
pictcd as having emotional origins, 

Sometimes the veibal scoie is elevated by an artificially cultivated vo- 
cabulaiy Some patents cncouiage chilchcn to build laige voeabulaiies, and 
some students and adults make a gieat client to Icain new wends The tester 
sometimes obseives in such subjects a love ol big woids and an eftoil to give 
unpiessively complicated answers to simple questions. The peison who lias 
a one-sided veibal development often does bettei on mcall questions (In- 
foimalion, Vocabuhuy) than on items demanding mdependent thought 
(Cornpi ehension, Similarities) 

Since tlieie is no single inteipietation for any pattern of Vcibal-Peiform- 
ance chfleiences, such a dilfeience is meiely a signal to the teslei that furthei 
data on the case are needed A study of the test peiformance as observed. 
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an inquiry into the person’s background, and usually supplementary tests are 
required to arrive at a deeper understanding of the difference. 

38. Digit Symbol is an exception to the finding that delinquents do better on Per¬ 
formance tests than nondelinquents. There is, in fact, a significant difference 
in the opposite direction How can this be explained? 

39. Wechsler deliberately included subtests which are susceptible to emotional in¬ 
fluences in his measure of "intelligence ” In your opinion, does this increase or 
decrease the usefulness of the test? 

40. Bill and John, two 15-year-olds, are referred to the school psychologist because 
both are failing in ninth-grade work, their courses being social studies, Eng¬ 
lish, general science, and art appreciation Both have IQs of 93, but Bill has a 
Verbal IQ of 95 and a Performance IQ of 92, while John has a Verbal IQ of 
87 and a Performance IQ of 106. How would the interpretations and sugges¬ 
tions for dealing with the two boys differ? 

41. A relatively low Performance IQ suggests emotional disturbance. When Ver¬ 
bal IQ is lower than Performance IQ, can we conclude that the person is well 
adjusted? (Consider Mark, p. 191, in this connection ) 

Interpretation of Subtests and Profiles 

The Wechslei scale is neatly oigam/od into sublesls, and many attempts 
have been made to develop sepaiatc into pi otations of the seveial sublesls 
The meaning of subtest scores is not leducible to such simple Unnslntions as 
“Low Vocabulary means poor ability to deal with symbols.” VooabuUuy is 
affected by the subject’s acquaintance with our verbal cullmc, including 
schooling, and his ability to expiess himself, winch may bo impeded by emo¬ 
tion It diffeis fiom Similarities in that Vocabuhuy can bo passed on the basis 
of lecall, wlicieas Similuuties requnes reoigamzing infoimalion. The ele¬ 
ments that influence each particular subtest can be known only tlnough 
great experience with the test and study of the reseaich literature 

Wechsler finds some association between patterns of sublosl semes and 
paiticular types of mental disoider, and has recommended the test for clini¬ 
cal diagnosis. Many clinicians have developed special fonnulas of their own 
for combining sublest scores into indices supposedly ehaiaetenstic of luain- 
damaged patients, sdn/ophiernes, etc. As an llluslialion of tllost* clinical 
hypotheses we may quote Seliafei’s description of the pattern found in 
psychopathic chaincloi disorder (1948, p. 54), 

The characlenslic pattern is a supeiionly of the Pcifoimance level over the 
Verbal, low scenes on Compieliension and Similaiities and high semes mi the tests 
of visual-motoi coordination and speed [Object Assembly, BID, Digit Symbol]. 
Often the Digit Span scoie does not diop, ldlectmg the ch.uactoiistie blandness. 
Fiequently Pictuie Arrangement is conspicuously high. This is especially tme for 
shrewd “sehemeis ” If Pictuie Completion is lngli, ovei-alertness or watchfulness 
is piobably characteristic 

Qualitatively the chief feature is usually blazing recklessness in guessmg at 
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answeis . . ‘George Bernard Shaw wrote Faust,” “Magellan discovered the 
North Pole,” "Chattel means a place to live (chateau),” “Ballast is a dance (bal¬ 
let),” “Pioselyte means piostitute,” and so forth. . The ovei-all pattern will 
indicate that this is a bland, umefloctive, action-oriented peison whose judgment 
is pool, whose conceptual development is weak, hut whose grasp of social situa¬ 
tions may yet be quick and accurate. 

Some of the proposals for diagnostic interpretation have been little more 
than plausible guesses 01 generalizations from small, uniepresentative sam¬ 
ples Even the suggestions based on sound research have limited practical 
value, piimarily because they rest on unreliable difference scores. Only un¬ 
usually Luge diffeiences between subtests (greater than 3 scaled-scoie units) 
should be taken seriously (Wechsler, 1958, p 164). 

Theie is theoietical justification for expecting brain damage to impede 
one type of peifoimance moie than another, or for expecting psychopa ths to 
suffer whcic pietcntious, incautious responses are penalized. The effect of 
peisonality is masked, however, by the influence of general mental ability, 
othei aptitudes and experience factois, attitudes m taking the test, and ran¬ 
dom eirors. Many studies agree that on the average schizoplnenics have 
Veibal IQs lughei than Perfoimance But when we look at Rapaport’s data 
(1945, Appendix II), we find that only 31 out of his 72 selnzophienics have 
Veibal IQs five points oi moie above the Peiformance IQ. Even among the 
highway paliolmen used as a comparison (“normal”) group, 18 out of 54 
showed this “sign.” 

Basing diagnosis on multiple signs leduces enors of classification, but 
raiely docs a patient show all the signs of his class. At best, one can hope to 
find statistical trends which distinguish groups of psychopaths (foi example) 
fiom othei gioups No objective tieatment of the Wechslei scores has proved 
able to classify individual patients with a useful degree of accuiacy Indices 
repicsentmg “scattei” of subtest scores—e g , the lange from highest to low¬ 
est subtest scoie—aie worthless as diagnostic signs (Patterson, 1953, pp 
41-76) 

It will be noted that Schafer did not piopose to identify psychopaths by a 
numcncal tieatment of subtest scores He examined the nature of the errors 
and successes to airive at a qualitative pictuie of the peisonality The Wechs¬ 
ler scale is in some ways supeuoi to othei tests oi mteivicw procedures 
as an aid in forming such impressions, because the questions aie the same 
foi all subjects, aie varied, and elicit highly revealing responses. If the clini¬ 
cian wishes to describe the subject, he should consider the Wechsler subtests 
individually and qualitatively (with due awareness that be may be inter¬ 
preting landom variation). The clinician must not regard an impression 
formed in tins manner as a diagnosis. The impiession is useful, but it is not a 
scientific conclusion. The Wechsler yields a general measure of mental abil- 
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ity and a verbal-peiformance difference, and beyond that can offer hints 
leading to fuilher study of the individual, 

Befoie going on to a moie general discussion of performance tests, we can 
summarize briefly llic virtues and defects of the Wechsler scale It is effi¬ 
ciently designed, mtciesting to most subjects, and at least as valid foi pre¬ 
dictive piuposes as the Stanfoid-Binet It coveis a bioadci ningo of tasks 
and affoids exceptionally good opportunities foi qualitative- obseivution of 
behavior and thought processes. The nouns foi the test, once a point of sen- 
ous cuticism, have been gieatly impioved As a pi act teal individual test, the 
Wechsler falls short in only one particular the scale has insufficient lange to 
measuie veiy high and vciy low abilities dependably 
The test is, liowevci, a distillation of clinical expcirence, and this contrib¬ 
utes both to its strength and to its weakness It is a useful sample of complex 
belravioi m which emotional and intellectual faetois aie entwined But it is 
based on no cleai theoiy of intelligence and makes no seiious efloit to sepa¬ 
rate mental ability fiom othei aspects of adaptation The tasks aie chosen 
fiom techniques invented Unity ycais 01 more ago, and them is no adequate 
rationale foi inteipielmg the subtest scores It is reasonable to hope that 
some future worlcei will stait from a thorny ol mental processes, choose oj 
design tests to measure those ptuliculni processes, and so auive at a supra tor 
diagnostic device. The total scoic oil such a lost would almost i oiLutily om- 
lclate substantially with Wechslci’s 

42. Many clinicians have tried to select an abbreviated test from the Wechsler 
series so as ta obtain a quick measure of ability, though one of inferior re¬ 
liability (McNemar, 1950) What would you consider m deciding which three 
subtests to use? Which three subtests seem best to you for this purpose? 

43. Is a high or a low correlation between subtests desirable in a general mental 
test? 

44. What description of the patient’s thought processes is suggested by each of 
these responses to "Why should we keep away from bad company?" (Schafer, 
1948) 

a Your friends will talk about you, if we want to live in a good environment 
we must choose good company (IQ ~ 107) 

b. I don't know if that necessarily holds true To prevent picking up thoir bad 
habits, I guess (IQ — 123) 

c It’s a trend toward living the same kind of a life, get bad yoursolf. 
(IQ = 127) 

45 Match the responses in the preceding question to these answers to "Why do 
we have laws?” given by the same set of patients 

a Govern the behavior of people. (E queries j There has to be some mainte¬ 
nance of order by which government policies are carried out as well as 
personal behavior of individuals 

b To have a law-abiding group of people, otherwise they would corrupt the 
city 

c. To make good citizens out of us, to keep the unruly under control 
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46. Harper (1950), comparing 245 schizophrenics to 237 normals, established re¬ 
liable differences between subtests A formula for combining standard scores 
on subtests is offered .28 Inf — 15 Comp + .17 DSp — .19 Pic Com + 
25 BD — .35 DSym (+ other small terms) A "cutting score” halfway between 
the mean for normals and the mean for schizophrenics was used In a new 
sample, 68 percent of schizophrenics fell beyond the cutting score The formula 
is thus shown to be truly discriminating. In view of the large number of mis- 
classifications, what value does the formula have in practice? 

47. According to Harper's formula, schizophrenic profiles tend to have a high 
point on Block Design and a low point on Digit Symbol. 

a. Can you explain this? 

b. Could you give an equally convincing explanation if the oppos te had been 
found? 

48 What advantage would there be in using an "intelligence” test to diagnose 
abnormal personalities, over using a "personality" test having similar validity 
for that purpose? Would this argument hold if the "personality” test had 
definitely higher validity for diagnosing such cases? 


WHAT PERFORMANCE TESTS MEASURE 

There is no need to desenbe peiformance tests otliei than the Wechsler m 
detail Some, like the Arthur scale, are collections of tests coveiing a vanety 
of perfoimancos Otheis, like the original Kohs Block Design Test (see p 41) 
oi the Pollens mazes (p. 29) arc devoted to a single type of item 

Some peifonmince tests are belter measuies of geneial ability than otheis, 
eithei because they are more reliable or because they make a gieatei intel¬ 
lectual demand Simple timed formboaids demand manipulative speed 
moie than thought and have rather low conelation with geneial ability In 
the WAIS the tests which correlate highest with the Perfoimance IQ aie 
Block Design and Pictme Completion Digit Symbol actually correlates 
higher with the Vcibal IQ than with the Performance IQ (Wechsler, 1958, 
p. 255) 

Cultural Influences 

Fiequenlly it lias been claimed that pcifoimance tests are “eultuie fiee” 
A “eultuie-hee” test is one on which scores aie completely uninfluenced by 
cxpeilcnce in a pailicular enviionment Such a test would give a fan com- 
paiison of mental abilities in diffeient countiies and across chfleient social 
classes 

Educational handicaps show up dnectly m a verbal test Tins is illustrated 
by tests on English “canal-boat childien,” who live a nomadic life and 
have an impovenshed, unmtellectual environment Binet tests con elated 
58 with educational level, but a peiformance test correlated only .26 The 
performance IQs were about 10 points highei (Gaw, 1925). 
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The skills involved m perfoimance tests are developed ihiough learning, 
and every cultuie provides some amount of training along the lines tested. 
Egyptian psychologists examined mental development in a primitive tube 
living on the edge of the desert (Fahmy, 1954) They found that scores on 
most performance tests weie quite a bit below the Euiopean average for 
children of the same age. On a test, however, which called for assembly of 
colored mosaics (similar to Block Design) these children averaged slightly 
above the Euiopean nouns. Color plays a large pait in the ceiemomes of 
this culture and in the childiens games. This evidently helps their test per¬ 
foimance by giving them experience in examining patterns 01 by developing 
their interest in such tasks. (See also Havighuist et al., 1946.) 

Tiaining in paiticular types of disciimmation and reasoning piobably in¬ 
fluences only a few perfoimance measures The subtle effects on attitude and 
motivation aie likely to affect all tests The educated classes in Amciica, m 
Euiope, and in nations influenced by Western civilization ait* taught from 
eaily childhood to take intellectual matters scnously. The clukl is icwaided 
for answenng adults’ seemingly pointless questions lie shares puzzles and 
word games with his playmates, and these experiences also cause him to take 
artificial pioblcms seriously. These activities teach an attitude of self-ciltical- 
ness and competitiveness 

There have been many attempts to compare the mental abilities of various 
nations and lacial groups by means of peifoimance tests oi tiunslalod voibal 
tests. Diffeiences in average perfoimance aie found in most studies, hut the 
differences aie fairly small In evciy group tested a huge piopoition does 
belter than the average of the white sample used in standauli/ing the tests, 
This is evidence m itself that no one racial group has a monopoly on talent. 
Precise compaiisons of gioup averages have no practical importance, hut 
they might be of great scientific importance if tests were equally fan to all 
groups. It is now generally agreed that no umveisal test for measuring men¬ 
tal ability can be developed Any test calls foi habits and attitudes winch 
some cultuies favor and othei cultures inhibit The test shows how well per¬ 
sons tested have developed along those lines, not how they rank on all tasks 
or how blight they aie innately. 8 

49 In what ways, if any, might cultural differences affect performance on each of 
the following tests? 

a Formboards (fitting blocks Into variously shaped holes) 
b Wechsler Picture Arrangement 
c Porteus mazes 

3 Racial comparisons have frequently boon misinterpreted because liberal writers want 
to prove that there are no innate differences in ability, and certain conservatives want to 
prove that nonwhite groups will not profit from improved educational opportunity. Bal¬ 
anced accounts of the many studies and of their possible liitcrpretaMons are given in 
L. Tyler, 1958, pp 276-309, and Anastasi, 1958, pp 542-575. 
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Emotional Influences 

Performance tests generally demand a longei period of sustained attention 
than the shorter items of individual vcibal tests This provides a greater op¬ 
portunity foi confusion or frustration to build up, and as a result the perform¬ 
ance test is moie likely to leflect emotional distuibance Poiteus (1950) dis¬ 
cusses two studies of boys and girls in a reformatory, wheie maladjusted 
delinquents weie compared with law-abiding, well-behaved inmates In both 7 
studies, the adjusted and maladjusted gioups had similar Bmet IQs, but on 
the Poiteus maze the maladjusted group diopped about 10 points below the 
others. Another study found that gioup psychotherapy raises the Porteus 
MA of schizophrenic patients by two yeais (H N Peteis and F. D Jones, 
1951). This suggests that the psychotherapy releases ability previously sup¬ 
pressed by emotional conflicts 

Practical Correlates 

Tire fact that pcifoimance tests aie relatively independent of educational 
background laises their validity for some purposes and loweis it for others. 
When a tester is tiymg to pi edict subsequent educational achievement, the 
veibal test is likely to be more informative Whatever handicaps depress the 
verbal score will also lnleifeie with futuie attainment in most schooling, as 
was noted in E. L Thorndike's comment (p. 189) Pie went on to voice the 
common expectation that a less verbal test would be a better piedictor of 
practical adjustment. Thcie is some evidence to suppoit this view In one 
study, the adjustment of boiderlinc mental defectives in the community cor¬ 
related 77 with then Poiteus maze scoies, but only .57 with Bmet scores 
(Porteus, 1939). An cailier study used ratings of the efficiency of children m 
a school for the mentally letarded as critena, Theie weie separate ratmgs on 
“educational efficiency” and “industrial efficiency” (i.e., performance in oc¬ 
cupational tunning). The Bmet picdicted the formei much bettei than the 
maze (the lcspective correlations being .81 and 59 for gills). But the Bmet 
predicted the tiude-peifoimance criterion less well (.66 vs 75) (Berry and 
Poiteus, 1920), One might expect the Wcchsler Veibal score to correlate 
highei with school success than the Pcifoimance score, The only predictive 
coiiclalions lepoitcd, however, show neaily equal correlations foi the two 
scores ( 62 and 65 respectively, Fiandsen and Higginson, 1951. See also 
Mussen ct at, 1952) Fuithei investigations aie needed to check and explain 
this finding. 

The perfoimance test has special importance m the clinic, Performance 
tests generally, depend less on habit and more on ability to attack a new 
problem They therefore are quicker to reflect the adverse effects of emo- 
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tional distuibance or brain damage, and perhaps quicker to reveal the effects 
of therapeutic treatment In cross-sectional studies of aging, for example, 
nonlanguage tests begin to decline m the 20’s, whereas verbal ability holds 
almost constant until the mid-4Q’s. When radical brain surgery is performed 
in an attempt to aid a psychotic, we might expect his intellectual perform¬ 
ance to be affected. His MA on the Binet 01 Wechsler Full scale or on a 
group test will not be changed in any predictable way, the aveiage change 
being negligible IIis MA on the Porteus mazes will show an immediate drop 
of about 2 yeais, though it will ultimately lecovei and rise far above the origi¬ 
nal level (Mettler, 1949) This change is consistent with the clinical pictuie 
The patient’s peisonality shifts fiom a depressed, wonied state to a carefiee, 
adaptable state m which he gives little thought to the fntuie. lie then gradu¬ 
ally stabilizes his behavior m a socially constructive pattern The postopera¬ 
tive loss of planning and foiesight is observed m his maze performance, and 
it of couise leads to eiroi A similar decline appears in Object Assembly and 
Digit Symbol No impairment is found m the Verbal tests oi in Block Design, 
Picture Completion, and Digit Span 
Since perfoimance tests involve spatial and peiceptual abilities which pie- 
dict success m ceiUin types of jobs (see Chapter 10), they might have some 
significance for vocational guidance. As a factoi analysis of the Wechsler to 
be repoiled later shows (p. 264), however, its subscores do not reveal these 
separate abilities cleaily Othei gencial peifoimance tests aie even less sutis- 
factoiy as mcasuies of special ability. Tests winch piovide puroi measures 
of specialized aptitudes will generally give bottci inhumation for occupa¬ 
tional choice 

50 If it were practical to use an individual test for selecting Army officers, would 
a verbal or a performance test be preferable? 

51. Comment on this statement "A person's true level of mental ability is shown by 
whichever IQ, verbal or performance, is higher” 

52 Leona Tyler (1956, p 10) makes this statement about performance tests and 
nonverbal tests "If they are worth less to us than we expected as substitutes 
for the typical verbal intelligence test, they are worth more as supplements." 
What evidence justifies this statement? What implications does it have for plan¬ 
ning a testing program? 


NOTEWORTHY INDIVIDUAL TESTS 

The Wcchslci scale, combining as it does a good peifoimance moasme with 
a good vcibal measure, has almost entnely 1 opiated carhei pci foil nance bat- 
tenes Among gcncnil-puipose predictois, the Wethslei and the Stanfoid- 
Bmet aie equally prominent, with no othei senous competitor Oui summary 
of impoitant individual tests, then, would have few entries if the ciitonon for 
admission weie wide use at the present time Attention needs to be drawn, 
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however, to some little-used tests of good quality, and to tests mentioned in 
the basic research studies of earlier days. Revisions of some of these tests are 
being made, and they may again become prominent. 

Each listing below gives the title of the test, the authois, the publisher, 
and the date of major editions (including always the earliest and latest), 
ages 01 grades for which suited, lemarks about nature, purpose, and quality. 
In prepaiing these statements, the writer has relied heavily but not exclu¬ 
sively on comments made by reviewers for the Buros yearbooks. 

® Columbia Mental Maturity Scale, B. Burgomeister, L. H Blum, Irving 
Lorge, World Book, 1953. Ages 3 to 12, Each item consists of three or more 
drawings printed on a large card The child points to the one which does not 
belong with the others Well suited to testing physically handicapped chil¬ 
dren. Though the test is brief, reliabilities near 90 are reported Correlates 
about 75 with the Stanfoid-Binet 

© Draw-a-Man Test, Florence Goodenough, World Book, 1926 Ages 1 
to 10 The child is asked to draw the best man he can. Scoring takes into ac¬ 
count the basic structure of the diawmg (e.g, aie the arms attached to the 
tiunk?) and details of features, clothing, etc This is a simple test to admims- 
tei, and scoring utles weic caicfully piepared Though the Draw-a-Man can 
be applied in all cultures, it is dependent on cultural influences. Some com¬ 
parable tests (c g., House-Tiec-Pcrson) aie used as a technique foi examin¬ 
ing personality, ralhei than as a measure of intellectual development alone. 

® Leitei International Performance Scale, Russell G Leiter; C PI Stoelt- 
mg, 1936, 1948 Ages 2 to 18 The tasks require perceptual matching, anal¬ 
ogies, memory, and other varied items, many of them similar to verbal tests 
The test is given with very simple diiections (spoken or pantomime), and 
the items themselves require no language The test has many excellent fea¬ 
tures, being especially suited to handicapped childien The IQ conversions 
aie of questionable accuracy at preschool levels 

® Mcrrill-Palmer Scale, Rachel Stutsman, Sloelting, 1931 Ages 2 to 5 A 
scale for preschool childien using interesting games, puzzles, pictuies, etc. 
Language questions aie simple tests of compiehension ("What cries?”). 
Some tests involve dexterity (cutting with scissois) Speed is heavily empha¬ 
sized. The technical quality and content of the 1931 veision compares unfa- 
voiably with the Slanford-Bmet. 

® Minnesota Preschool Scale, Floience Goodenough, Katlnyn Maurer, 
M J van Wagenen, Educational Test Buieau, 1932,1940. Ages IK to 6 Ver¬ 
bal comprehension and memory tests aie used in a verbal scoie. A nonverbal 
scale includes form recognition, tracing, picture completion, block building, 
and simple puzzles Some long-term follow-up studies of predictive validity 
have been made This is an accuiate test for ages 3 to 5, but not one with 
great appeal for the child. 
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• Pintner-Paterson Scale of Performance Tests, Rudolf Pintner and D. G. 
Paterson; Psychological Corporation, 1927. Ages 4 to 16 This was the first 
substantial performance battery. It included object assembly, foimboards, 
and Healy Picture Completion (pictuies to be completed by fitting in 
blocks). It played a major pait in research and clinical work prior to the de¬ 
velopment of tire Wechsler scale. Scores depend heavily on speed, and relia¬ 
bility is unsatisfactory 

• Point Scale of Performance Tests, Grace A. Aithui, Stocking, Psycho¬ 
logical Corporation, 1925, 1947. Ages 4A years to adult. One of the best col¬ 
lections of performance tests standardized on the same sample. Includes 
formboards, maze, block design, etc. MAs tend to be lowei than SB MAs ow¬ 
ing to defects in the standardization. 

• Stanford-Binet Scale; L M Terman and Maud A. Momll, Houghton 
Mifflin, 1916, 1937, 1960 Ages yeais to adult. (See pp 163 if.) 

• Valentine Intelligence Tests for Children; C. W Valentine, Methuen, 
1945, 1953. Ages 1A to 15 A British scale combining items from well-tried 
sources (Gesell, Buit-Binet, Stanford-Binet, Meirill-Paliner, etc,). Generally 
legarded as a superior test for preschool ages, though its sUndaidi/ation is 
inadequate. 

• Wechsler Intelligence Scales; David Woclisler; Psychological Corpora¬ 
tion, 1940,1955. WISC, 5-15 yeais; WA1S, 16 years to adult. (Sec pp 191 If.) 

TESTS OF INFANT DEVELOPMENT 

Tests such as we have discussed to this point set a task foi the child and ob¬ 
serve how well he can peiform it. Such a method cannot he applied to the 
infant, who docs not compiehcnd insliuclions and has not learned to do 
things on command Tests of early development consist piimanly of observa¬ 
tions of the child’s response to stimulation 

The basic aim of most of the scales has been to determine whelhei the 
child is showing the developments normal for his age, rather than to assess 
mental level specifically. Wo cannot test a 1-year-old on ahstuu t thinking, 
wo don’t even know whether he is doing it at this age. We cannot test the 3- 
yoar-old on complex reasoning, because he has not developed adequate sus¬ 
tained attention and undeistanding of dnections to attempt the problem. In¬ 
vestigators have conccnlidlcd on those aspects of behavioi which can be 
identified objectively in tlic young child Baylcy, in her definitive study 
(1933), used a composite of 185 items from existing scales, of which the fol¬ 
lowing are representative The numbei in pmcntheses is the scale placement 
—the age in months at which the development is nonnally found. 

(0 6) Lateial head movements, prone 
(1 4) Vertical eye coordination. 
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(3.0) Reaches for ring. 

(3 6) Manipulates table edge. 

(5 5) Discriminates strangers 

(5.9) Vocalizes pleasure 

(66) Lifts cup by handle 

(8 6) Says da-da oi equivalent 

(9 3) Fine prehension 

(9.9) Rings bell purposefully. 

(13 5) Makes towei of tvvo cubes 

(16 6) Turns pages 

(20.1) Square or liiangle in Gesell foimboaid, reversed. 

(215) Names three objects. 

(25 0) Understands two piepositions. 

(28 4) Picture completion. 

(34 6) Copies cncle, one success on three trials required. 

(35 6) Remembers one of four pictures 

The heavy emphasis on sensorimotor development in the infant tests 
makes it impossible to mleipiet them as measures of mental ability As Boyn¬ 
ton comments (Monroe, 1941, p. 629), “When the Linfert-Hierholzei Scale 
attempts to moasuie intelligence m teims of the child’s ability to follow visu¬ 
ally a ball or to use a spoon in eating, or when Chailotte Buhler looks for in¬ 
telligence m a child’s smile oi m the fact that he seeks a lost toy, it is appar¬ 
ent that the pioceduie involves malleis which neithei the layman nor the 
psychologist would icgard as mtegial aspects of intelligence at a later age” 
One way of meeting this objection is to legai d the data as meaningful m 
their own light, as showing what the infant is doing Infoimation about the 
normal development of cooidination, for example, may be impoitant for the 
pediatrician who must lecogmze and diagnose disease, dietaiy deficiency, or 
abnoimality. Data about development of sensorimotor behavioi may be im¬ 
portant for psychological theoiy also. 

Most investigatois, however, have wanted to forecast mental develop¬ 
ment. The psychologist dealing with placement of children for adoption, 
foi example, dashes a good eaily measuie of mental ability. A good mental 
mcasiuc eaily m infancy might also be of value in identifying ceitain types 
of mental delect which can be overcome by eaily application of appiopnate 
diugs. For such applications, the validity of the infant test as a mental test 
must be examined. 

The correlations in Table 21 (p. 176) show that tests in the first two years, 
where the items are piedommantly sensonmotor m type, have negligible 
correlations with tests at school age A test at age 2 or 3, however, has fair 
ability to forecast school-age intelligence. 

The rise m con elation is to some extent a reflection of increase in reliabil- 
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lty with age. Tests for very young children are apt to be unreliable, in the 
first place, because the child’s attentiveness and alertness fluctuate. Even if 
enough items are used to obtain a good estimate of his status in a given week, 
his standard score shifts maikedly from month to month. All infants show un¬ 
explained spin ts of development, The child may forge ahead rapidly in loco¬ 
motor ability and then remain at tine same level with no further change for 
weeks. Moreover, he may make progress in only one aiea at a time, iinpiov- 
ing his vocabulary while Ins coordination shows no further advance or vice 
veisa. 

Accuiacy of measurement can be increased by using more items, or by 
combining estimates made in successive months. J E Anderson (1939, p, 
376) suggests other precautions to mciease the dependability' of the meas¬ 
ures- 

The earlier . . the measuiements are made, the less ldiance can be 

placed on a single measurement or observation, if that measurement or 
observation is used for piedictmg subsequent development 
The earliei . . . the measurements arc made, the greater caie should 
be taken to secure accuracy of obseivation and iccoid and to follow 
standardized pioceduies 

The earlier . the measuiements aie made, the moio account 
should be taken of the possibility of distuibmg faetois, such as negativ¬ 
ism and lefusals, that opeiate us constant enois to mince score 
Since development is a timed seiies of relations oi sequences, there 
are for many functions peiiods below which only a small pm turn of the 
function can be measured and above which a progressively laigoi por¬ 
tion can be measured. Hence, the possibilities of piedietion aie limited 
and progression with age is not an infallible indicaloi of the value of a 
measuiement. Every effort should be expended to seciue the most ac¬ 
curate and piedictive tests by standaichzmg tests against multiple [cri¬ 
teria, particulaily measures of ability in latei life] rather than against 
single catena 

Tins last point is at the hcuit of the difficulty If the {unctions that consti¬ 
tute intelligence cannot be obseived enily in the child's life, substituting a 
measure of nomntelleetual functions is no solution. We must wail until pur¬ 
poseful pioblem solving is piesent, more than that, we must wail until these 
types of behavioi are reasonably well stabilized, since ineasiues made while 
a type of behavioi is just emerging are notoaously unielwblo This conclu¬ 
sion is supported by the results of Maurer, cited earliei, that observations of 
undnected behavioi aie not predictive of later IQ, but suitably chosen task 
peifoimance is Since tasks cannot be set for the child much below the age of 
2 theie is little hope of piedictmg the IQ fiom tests in infancy 
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A much moie optimistic but still cautious position on infant testing is taken 
by Escalona (1950), She regards the test as an opportunity to observe the 
functioning of the whole organism m "a situation which has structural and 
dynamic propeities”; i.e , she turns away from an attempt at exact measuie- 
ment of a single attribute and uses the test to enrich an impressionistic ob¬ 
servation. Her position is influenced both by the field theoiy of Kurt Lewm 
and by the psychoanalytic appioach to test interpretation represented by 
Rapaport and Schafer. She considers the child’s total social response, man¬ 
agement of his body, and attention pattern in Gesell’s tests, and tries to judge 
his development qualitatively One case she describes as follows: 

An infant was first seen at the age of three months At that time he gave eveiy 
evidence of making unusually good developmental progiess, earning test scores 
which placed him in the acceleiated lange He was characterized by a veiy high 
activity level, bodily activity increased markedly in response to all stimulation His 
capacity to tolerate delay or frustration seemed lower than that of most infants of 
the same age At the thiee months age level, test performance reflects pnmarily 
gross motor coordination, vigorous responsiveness to stimulation and perceptual 
discrimination At a latci age, however, tests aie designed so as to also elicit fine 
motor cooidination which lequires inhibition of impulse, as well as pioblem solving 
behavioi which implies delay m attaining a goal 

A piediction was made that the child tended towaid immediate discharge 
of tension, would pi obably find tasks calling for inhibition of impulses frus- 
tiating, and would be likely to earn only an average IQ on the latei tests 

The child was retested at 9 months and at 22 months On both occasions 
he was again noted to be a moie than ordinanly active child His total IQ 
dropped from the superior to the average range Items requmng fine motor 
coordination and those lequumg a playful and indirect appioach to a goal 
weie passed at a low level or were lefused altogether Veibal items and those 
lequiring immediate giasp of a problem, however, were peiformed at high 
average and superioi levels. Gross motor coordination remained outstand- 
mgly good. 

In many instances the tester can judge whether a child has performed at 
his best Escalona divides children into two categories m this lespect. Those 
whose tests are judged “optimum” change their standing very little on tests a 
yeai or so latei. No prediction can be made foi those whose tests are judged 
nonoptimal. Thcie are many laige changes m this group 

Here we have a characteristic contrast between the psychometnc and the 
impressionistic approach to testing The psychometric criteria applied by 
Bayley indicate almost no predictive value in infant tests Escalona’s clinical 
method is said to give not only a statement of developmental level which is 
predictive m many cases but also a qualitative descnption of strong and 
weak points. There is no reason to question the correctness of either view- 
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point. It is obvious, however, that not just any impressionistic interpretation 
can be depended upon. Until Escalona communicates the observational cues 
and the theory she uses to interpret tests, and until the various interpreta¬ 
tions have been checked by systematic reseaich, hei method cannot be used 
by others, 

53. Characterize the abilities used within each year level of Bayley's scale (assum¬ 
ing that the items listed are characteristic) 

54. How do the abilities tested in her scale differ from those tested in the Stanford- 
Binet? 

55. To what extent would differences in experience give some children an advan¬ 
tage on the tasks listed’ 

Important Infant Tests 

• California First-Year Mental Scale, Nancy Bayley, University of Califor¬ 
nia Press, 1933. Ages 1 to 18 months. A set of items chosen fiom other scales 
and standardized by retesting the same gioup of about fifty infants repeat¬ 
edly. More data are available on this scale, taken as a whole, than on other 
infant tests, 

• Cattell Infant Intelligence Scale, Psyche Gattoll, Psychological Corpora¬ 
tion, 1947. Ages 2 to 30 months. This is an attempt to extend the Stanford- 
Binet downward. Items aie somewhat more complex than those in other 
schedules, but at the youngest ages simple poiceplual responses (eg., look¬ 
ing at a moving person) are counted The test at age 1 correlates as high as 
56 with SB IQ at age 3 hut has very low correlations with school-age IQ 
(Cavanaugh at al , 1957) It has no predictive value before the first birth¬ 
day. 

• Gesell Developmental Schedules, Arnold Gesell and otlieis, Psychologi¬ 
cal Corporation, 1925,1949. Ages 4 weeks to 6 years A schedule of behaviors 
divided into foui areas: motor, adaptive, language, and peisonal-social The 
child is stimulated, e.g, by placing a block m front of him, and his icactions 
are compared with expectations foi his age. The standardization, lehability, 
and interpretation of scoies are open to question As Anastasi says (1954, 
p. 283), "These schedules may be rcguuled as a lofinemont and elaboration 
of the qualitative obscivalions loutmely made by podialiieiuns," 

• Gnffitlis Mental Development Scale; Ruth Guffiths, Univei.sily of Lon¬ 
don Press, 1954. Ages 0-2 yarns. Five eaicfully piepaied scales, using onginal 
items together with those of Gesell and others, measure locomotor, personal- 
social, hearing-speech, hand-eye, and performance developments. The total 
scale includes 260 items and peimits moie reliable measurement than any 
other instiument. Retest reliability with more than a six-month interval is 
87 (Griffiths, 1954) Too little research is available to evaluate tire scale at 
present. 
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Suggested Readings 

Anastasi, Anne Race differences methodological problems. Differential psy¬ 
chology. (3rded ) NewYoik. Macmillan, 1958 Pp. 542-575 
An authontative discussion of the piopei interpretation of studies comparing 
test scores of racial gioups includes some representative findings. The subse¬ 
quent chapteis levicw several majm investigations, paiticulaily of differences 
between Negioes and whites. 

Brown, Elinoi W Observing behavior dmmg the intelligence test In Eugene Ler- 
nei & Lois B Muiphy (eds), Methods foi the study of peisonahty in young 
cluldien Monogi , Soc Res in Child Developm , 1941, 6, (4), 268-283. 
Responses of two 4-yeai-olds to the Stanfoid-Bmet aie piesented to show that 
peifoimance depends on peisonahty and response to the examinei, as well as 
on intellect Students should read this piotocol if they have not seen a dem¬ 
olish ation of individual mental testing, 

Richaids, T W Mental test peifoimance as a reflection of the child’s cuirent life 
situation a methodological study. Child Developm., 1951, 22, 221-233 (Re¬ 
punted in Eugene L Hartley & Ruth E Haitley, eds, Outside readings in psy¬ 
chology, 2nd ed New Yoik- Ciowell, 1958. Pp 260-273 ) 

A child's Binet peifoimance from age 3 to age 10 fluctuated from IQ 115 to 
IQ 140. Richaids traces observation lecords, parent attitudes, and personality 
tests to show a concspondence between test changes and changes in the pres¬ 
sures and satisfactions m the child’s life 

Schofield, William CiiUquc of scatter and piofile analysis of psychometric data. 
J clin Psychol, 1952, 8,16-22 

Schofield reviews the studies claiming to find information in Wechsler piofile 
shape that can be used foi clinical diagnosis Wishful thinking, accompanied 
by inadequate lescarch design, is blamed foi the widespread and unjustified 
faith m piofile interpietation The faults in this research should be noted in 
planning any validation study 

Teiman, Lewis M The discovery and encouiagement of exceptional talent. Amer 
Psychologist, 1954, 9, 221-230, (Reprinted in Don E. Dulany, Jr, & others, 
eds, Contiihutions to modern psychology New York Oxford University Press, 
1958 Pp 51-65. Also in H H, Rcmmeis & otheis, eds., Growth, teaching, and 
learning. New Yoik I-Iarpei, 1957 Pp 63-77 ) 

This leclme suiveys some of the pnncipal Amencan woik with mental tests, 
including Termaiu follow-up of exceptional cluldien. Teiman leviews the 
childhood differences between those who succeeded m latei life and those 
whose cateeis were mediocre, emphasizing the cultural factors that bring 
talent to fiuilion. 
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Group Tests of General Ability 


GROUP tests are used far more extensively than individual tests because of 
their economy and practicality Partieulaily in dealing with masses of sub¬ 
jects, whethei m the Army, industry, schools, or research, gioup tests aie in¬ 
dispensable. The better group tests are as reliable as compaiable individual 
tests, and for many objectives they have equally good predictive validity 
Moreover, they do not require specially trained testers. 

The group test is based on the assumption that subjects undei stand the 
nature and purpose of testing, and that each wants to do Ins best Wherever 
these ideal conditions aie not met, the scores of some individuals will be in¬ 
valid The individual test gives the examiner a good chance to note that 
the subject is ill, unduly tense, or confused by the directions, and thus the 
experienced examiner can recognize when the score is invalid In the gioup 
test, a standard piocedure is applied, and no special consideration can be 
given to individuals. The gioup test is the practical solution to the pioblcm 
of obtaining information when large numbeis of individuals must be con¬ 
sidered at once—for example, in classifying recruits 01 identifying pupils 
who cannot keep up with the normal pace, Wbeiever the important objec¬ 
tive is to make decisions which aie conect on the avciage, the gioup test is 
suitable Wlieiever the pumary consideiation is thorough understanding of 
the individual, the flexibility and intimacy of the individual test make it 
much more satisfactoiy, In schools, group tests are often used as a prelimi¬ 
nary device to identify pupils to he studied individually 

REPRESENTATIVE INSTRUMENTS 

Most of the early group tests were based on the “omnibus" or hodge podge 
principle of the Binet scale The test mixed a great variety of problems so 
that specialized abilities called for by certain questions (e.g, arithmetic) 
had veiy little influence compared to the general ability required by all the 
problems As the makers of the famous Army Alpha Examination put it, the 

214 



GROUP TESTS OF GENERAL ABILITY 215 


ideal was to find tests all related to the criterion and having very little lela- 
tion to each other, The omnibus test with items in haphazard order or with 
many shoit subtests was the most common type of group test until the 1940’s 
Such a test was used to obtain just one scoie, a measuie of geneial ability. 
Many recent tests are designed so that sections can be scored and inter¬ 
preted sepai atcly 

Instead of using the omnibus test where specific abilities tend to cancel 
each other, some Butish workers limited their group tests to items thought to 
be pure measuies of general ability Chailes Speaiman was the leading spirit 
in an attempt to isolate the essence of geneial ability by finding items which 
measured general ability and nothing else In the course of this woik he in¬ 
vented factor analysis, a statistical method which plays a laige pait m cur- 
lent test development (see Chapter 9) We need not at this point elaboiate 
on Spearman’s method beyond saying that, in effect, he looked for items 
which con elate with all other types of mental-test items, The best measures 
of “g” or general ability, accoiding to Spearman’s leseaich, weie abstiact 
reasoning pioblems Like Binet befoie him, Speaiman studied his items in 
an attempt to foimulate a definition of what Ins test measuied lie con¬ 
cluded that g consists of facility in “apprehension of one’s own expenence, 
the eduction of relations, and the eduction of coirelates ”—1 e., m making ob¬ 
servations and extracting general principles 

A Homogeneous Test: Matrices 

The matnx item is the single most popular technique foi measuiing men¬ 
tal ability, although it is better known abroad than in the United States This 
item was invented by L S Penrose and J C Raven in England and pub¬ 
lished as Raven’s Progressive Matrices Test m 1938 Raven, following Spear¬ 
man’s theory, desiied to measure the ability to peiceive relationships. The 
matrix item is a “two-dimensional” analogies pioblem, as illustrated in Fig- 
uie 35 The subject is directed only to select the design that completes the 
pattern. The figures are altered fiom left to light according to one punciplc, 
fiom top to bottom by another. The subject must identify these principles 
and apply them to deteimme the needed design 

The maliix principle is highly flexible The possible zange of difficulty is 
enormous, as can be seen m the examples given. The test may be adminis¬ 
tered individually or in groups, and may be speeded oi given with libeial 
time allowance For testing less mature subjects, the items can be piesented 
as a series of formboards where the subject actually chooses a block and fits 
it into the blank space. The directions are veiy simple, so that verbal under¬ 
standing plays little pait Indeed, with very easy initial items, the test can be 
administered m pantomime so that the verbal element is entirely eliminated 
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No one form of the matrix test is used widely, Since items are rather easy 
to prepare, psychologists in all parts of the woild have developed tests of 
their own. These 'versions can be used foi lining employees 01 selecting stu¬ 
dents foi special couises, with little fear that the items well become known 
in advance. The disadvantage of this vanation is the lack of compicliensive 
data on any one form The published version of the Progressive Matrices is 
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FIG 35 Matrix ilems at three levels of difficulty The first and second items are liko those In 
the Progressive Matrices, being of the usual difficulty The third item Is a vory difficult matrix In 
free response form, designed for tasting college graduates 


available in the United States, but its standardization is poor The noims for 
the gioup test are based on 1407 children (in half-year age groups) and 3665 
militiamen and 2192 civilians (in five-year groups). These aie English cases, 
and no fuither description is given For general clinical and educational test- 
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mg, inability to compaie a given case to acceptable American norms is a se¬ 
rious diawback. For selection within a group of job applicants, of course, 
noims aie of very little importance. 

The matrix test has a sizable conelation with individual tests For chil- 
chen, the conelation with SB is about 60 (Keir, 1949). For an adult sample, 
Wechslei Perfoimance conelates .70 and Wechsler Verbal 58 with Matiices 
(J Hall, 1957, see also Martin and Weichers, 1954). The subtest having the 
highest coirelation with Matrices is Block Design. Obviously, the matrix 
items aie i datively independent of the educational attamments which affect 
the Bmct and the Verbal scoie, though Ombredane et al (1956), in study¬ 
ing undei developed Afncan tribes, found that test scores were affected by 
level of education. Raven suggests that in piactical testing the matrix meas¬ 
ure of ability to solve new pioblems be supplemented by a measure of past 
attainment, such as vocabulaiy. This is especially important for persons 
whose efficiency is impaired by old age, emotional disturbance, etc 

The Raven Matrices weie adopted as the pnncipal test for military clas¬ 
sification m Gieat Butain duung Woild War II This nonveibal test was 
chosen to make sine that noimally intelligent recruits weie not rejected be¬ 
cause of pool education, Hie fact that the matux is so neaily a puie test of 
one ability limited its mililaiy usefulness Tests combining geneial, verbal, 
and numencal abilities pioved to be betlei piedictois of peiformance m 
training couises Specialized spatial-mechanical tests such as the Bennett 
geneially made a bettei contribution to piediction of success in mechanical 
jobs. The matux test was most helpful in piedictmg performance m visual 
signalling and ladai opeiatmg (Vernon and Pairy, 1949, pp. 235, 244). Ex- 
penence such as this has led the piactical testei to give moie attention to 
the specialized abilities than foimeily. A lelalively puie measure of g is not 
usually as good a piedictoi as a composite of g with veibal, spatial, or the 
othei abilities lequned by the couise or job to which the person is assigned 
One of the icasons Bmet’s test was highly successful was that it called for 
the specialized abilities m about the same combination that schooling itself 
did It is tlieiefoic a better predictor of school adjustment than a puier test 
might be 

The purely nonveibal scoie, liowevei, has one special function in school 
testing, It calls attention to pupils who have good masoning ability but who 
aie below standaul in leading and verbal development. Such cases are ob- 
scuied by a test that mixes verbal and nonveibal components together, and 
thus the school overlooks children who could do much better work if given 
suitable help. The nonveibal test is also useful in employee selection where 
range of educational backgiound is wide. Among African tribesmen trained 
to operate heavy mining machinery, the matrix test predicted performance 
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ratings with validity .51. This coefficient was based on performance after 
two practice tests, which proved more valid than measurement without prac¬ 
tice (Ombiedane et al, 1956) 

An Omnibus Test: Kuhlmann-Anderson 

American group tests have been influenced little by theoncs about intelli¬ 
gence. Instead, they have been developed pragmatically, by hying items 
and retaining those which coirelate with such criteria as school success oi 
job success. 

One of the impoitant Ameiican group tests is the Kuhlmann-Andeison In¬ 
telligence Test senes Most group tests are printed in a single booklet, with a 
diffcicnt booklet foi each three-giade range fiom kmdcigaiten to adulthood 
The Kuhlmann-Anderson, howevei, has nine dilleicnt booklets, so that a 
class can be given a version of the test closely fitted to its ability. The pupils 
who do well on this booklet may then be given the next highoi test, and 
those who do badly can be given the next easiei test to obtain a moie accu¬ 
rate measurement Undei this plan few pupils encounter items wheic they 
have to guess, and the test is shorter because unnecessary easy items are 
eliminated. 

The development of this scale is characteristic of the piocodurcs used in 
the older gioup tests of geneial ability. Beginning in 1916, Kuhlmann began 
a tiyout of items for use m state institutions in Minnesota, and m 1919 began 
(with Di Rose G Andeison) to piepaie fonnal tests in the next four years 
moie than 100 vancties of items weie tried out, 51 seemed piotnismg enough 
foi furthei use Foui more yeais of research led to a selection of 35 subtests 
for the published scale The scale then passed thiough five fuither editions, 
sometimes with minor changes of content to leplace unsatisfactory tests or 
to extend the nmgc, sometimes with modification of norms oi format. Au- 
thois of present-day tests often employ a sinnlai procedure but use the ex- 
penence of pievious investigators to shoiten the lesearch 

The scale now contains 39 tests, organised in booklets which paitly over¬ 
lap each other. Thus booklet K (kindeigaiten) includes tests 1-10, A (Guide 
1), tests 4-13, B (Guide 2), tests 8-17, and so on up to booklet G for Guides 
7-8 and booklet II foi Grades 9-12. Figure 36 shows lepicsentative test items 
Many of these item types are used in othci gioup tests. Each test consists of 
at least eight items A time limit is set cithci for each item oi foi the sub¬ 
test These limits aie liberal in the first two booklets, but latei levels intro¬ 
duce a substantial degree of speeding 

Ncaily all of the subtests require adaptation to new situations, yet they 
also depend on expenence and many of them involve special abilities (ver¬ 
bal, spatial, etc ) Kuhlmann and Anderson followed Binet’s principle of 
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combining such a great vanety of tests that no one specialized ability plays a 
large part m the scoie Verbal ability is important because the pupil must 
comprehend directions, but the test designers use simple vocabulary, mtio- 
duce reading only in the latei tests, and even at advanced levels use only 
short and famihai words. An example of ingenious testing is test 8, Counting, 
which measures judgment and accuiacy with numbeis without using the for¬ 
mal numbei system or other school learning. 

AA 

Test 2 Picture errors Test 5 Pattern Completion 

"Put a dot on the "Put in the stick that is left 

part that is wrong ” out of the second figure ” 



CD>(0>© 


Test 8 Counting 
"Put as many dots in the 
box as there are balls” 



Test 10 Copying (The square with 
lines is held up before the class 
for 10 seconds) 


top rattle doll 
sled playing 

Test 21 Classification 
"Find the one that does 
not belong with the others.” 


old rich wide 
poor green full 
Test 23 Opposites. 
"Find the two 
opposites ” 

inaudible distinct 
deafening faint loud 

Test 32 Arrangement 
"If these were arranged 
in order, which would 
be the middle one? ” 


robin winter horse 
song squirrel fence 
Test 26 Similarities 
"Find the three things 
which ore alike ” 

Basket 

Picture 

Test 34 Directions 
"If the word contains E 
but not R nor t write 
3 after it ” 


N-B-U-M-E-R 
N_ 

Test 28 Anagrams 
"You are given the first 
letter Write the rest 
of the word " 

5 6 8 11 15 

Test 39 Number series 
"Write the two numbers 
which should come next.” 


FIG 36 Representative Kuhlmann Anderson items The directions used in testing are 
much simpler and more complete than these abbreviated quotations suggest (Copyright 
1952, Personnel Press, Inc Reproduced by permission) 


Subtests which have substantial con elation with age weie selected in pref- 
eience to subtests with low conelations A second cnteiion was the coirela- 
tion of subtests with each other Those having low conelations with other 
subtests were preferred, so as to increase the comprehensiveness of the test 
by bringing in many aspects of ability This, of course, is just the opposite of 
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the procedure used in constructing a homogeneous test of a narrowly de¬ 
fined ability, The Kuhlmann-Andeison measuies substantially the same thing 
as the Stanford-Binet When enors of measurement are reduced by averag¬ 
ing three trials of each test, SB IQ and Kuhlmann-Andeison IQ conelate al¬ 
most perfectly (Deaibom and Rotliney, 1941). 

1. What abilities does the Kuhlmann-Anderson test require that the Raven Matrices 
do not? 

2. The Kuhlmann-Anderson norms are based on "15,000 cases from representative 
Minnesota, New York, New Jersey, and Pennsylvania communities . , selected 
in consultation with State Departments of Education." What further information 
about the norm group would be desirable? How satisfactory is this sample? 

3. In the normative sample, the standard deviation of the Kuhlmann IQ was about 
11 points at ages 6-10. Why is this fact important to the test user? 

4. The Kuhlmann-Anderson correlates higher with present achievement than does 
the Stanford-Binet Why? Is this an advantage or a disadvantage? 

5. Is it possible for a person to do consistently better on group mental tests than on 
individual tests? 

PROBLEMS OF DESIGN AND VALIDITY 
Dependence on Language 

The matnx test is entirely fiee horn verbal content, and the Kuhlmann- 
Anderson uses predominantly nonverbal items Many other popular tests, 
howevei, aie almost completely veibal Vocabulary and vcibal masoning 
have always been lound good piedictois of school and college success, and 
arithmetic masoning and number senes am also popului item forms. 


TABLE 24. Approximate Level of Reading Ability Required to 
Comprehend Items of Group Mental Tests" 



First Items 

Last Items 

Kuhlmann-Anderson 
(booklets for Grade 6) 

50 

75 

Otis (Higher, Form C) 

70 

130 

Terman-McNemar 
a. Information 

5.5 

10.0 

b. logical selection 

90 

17.0 

c Analogies 

6.0 

70 

d. Best answers 

90 

9.0 

Henmon-Nelson (Form A, Grades 7-12, 
nonanthmetic items) 

65 

140 


a 1 ho reading level is estimated by determining readability from sentence 
length A reading level of 7 0 is the performance or the average seventh-grader 
SouncJL 11 II Johnson and G L Bond, 1950 


A test which is strictly veibal will be much influenced by the person’s read¬ 
ing ability and familiarity with the language. While it may peimit a useful 
piediction of his school success, where language is continually important, it 
would be wiong to mteipiet Ins low score as a sign of lack of mental ability 
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To avoid such enors, most testeis prefei to use a test which includes both 
verbal and nonveibal items. Sometimes the two types are included m a sin¬ 
gle scoie, as in the Kuhlmann-Andcison. Sometimes separate scores are ob¬ 
tained, as in the Raven Matrices and the vocabulary test recommended to 
accompany it. 

It is wrong to assume that a test which requires no leading is independent 
of language abilities. The dnections aie almost always verbal, and not al¬ 
ways easy to comprehend. Sometimes the solution to a pioblem such as fig¬ 
ure analogies 01 matrices requues complex symbolic reasoning with abstiact 
concepts. The person almost ceitamly lelies on his vocabulaiy to amve at 
the answei A person whose language experience has been limited (eg, a 
deaf child or a bilingual) is likely to be handicapped on some of the so- 
called nonlanguage tests. 

Comparability of Scores 

Each test maker develops noims foi his tests using his own standardizing 
gioup It is particuhuly impoitant to lealizc than an IQ of a given size has a 
diffeient meaning in diffeient tests, or on the same test at diffeicnt ages. In a 
recent study ovei 2200 9- and 10-yeai-olds took foui prominent gioup tests, 
Where the Stanfoid-Bmet chstiibution indicates that about 220 should have 
IQs 120 and over, the Knhlmann-Anderson showed such IQs foi 137 chil¬ 
dren, and the Henmon-Nelson foi 524 childien At the low end of the scale, 
where 220 are expected to fall below IQ 80, Kuhlmann-Andeison repoils 53 
and Henmon-Nelson reports 119 such cases (Eells el al, 1951). 

In anothei study, Lennon compared three tests on equivalent samples and 
deteimmed what raw scores on the tlnce tests weie compaiablc (Lennon, 
1952). These scores weie then conveited to IQs and MAs, using the tables 
fiom the test manuals He found, foi example, that an IQ of 130 on the Tei- 
man-McNemar is earned by the same pupils who would earn 123 on Otis 
Gamma and 126 on the Pmtner Veibal An MA of 14 on the Terman corre¬ 
sponds to 12-9 on Otis and 1.3-6 on Pmtner Obviously differences in stand¬ 
ardizing samples cause IQs on some tests to aveiagc higliei or to spicad out 
moie than on ollieis. Anothei souice of vunation is the use ol now-obsolete 
statistical techniques intended to yield raLio IQs As all testeis shift to the 
use of standaid scoies oi percentiles within age gioups, compaiability of tests 
will depend wholly on the adequacy of noimmg. 

Degree of Speeding 

Most group tests of ability are given with a time limit Whether an ability 
test should be speeded is arguable. The time allowed for an ability test may 
be so short that standings are determined almost entirely by speed of work, 
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or may be so liberal that everyone finishes. Most tests present items in order 
of difficulty so that each student encounters the items he can do, and only 
the best student is pinched by the time limit. Table 25 indicates the effect 
on score when pupils are given added time on three typical tests Evidently, 
most pupils finish in the standard time all the items they can do For occa¬ 
sional cases, of course, speed will still be the principal factoi determining 
scores 


TABLE 25. Effect of Giving Pupils Additional Time on Group Intelligence Tests 


Age of 
Pupils 

Number of 
Pupils 

Test 

Standard 

Time 

Extra 

Time 

Mean Points 
Earned in 
Standard 
Time 

Mean Points 
Earned In 
Additional 
Time 

9-10 

223 

Otis Alpha 
Non-Verbal 

20 min, 

30 min 

65 0 

1.1 

9-10 

226 

Henmon-Nelson 

30 min 

20 mm. 

54 1 

34 

13-14 

235 

Otis Beta 
(verbal) 

30 min 

15 min 

60 4 

09 


SrnrncF Eolh, 1048 


Fonneily a distinction was made between “speed tests” {time-limit tests) 
and “power tests” (woik-hmit tests). A test with a time limit, liowevci, does 
not necessarily depend on speed To decide whether a tune-limit scoie de¬ 
pends on speed, we would need a special expciimcnt We would first give 
the test in the usual manner, obtaining scoie x, and then allow enough time 
for evciyone to finish, obtaining the unspeeded scoie y If most poisons have 
the same relative standing on x and on tj, the added time made little diffei- 
ence and the time-limit scoie depends on the same abilities as the untimed 
score (Helmstadtei and Oitmeyci, 1953, Cronbach and Wauington, 1951), 
In the Kuhlmann-Anderson manual an experiment is repoited in which chil- 
dien weie allowed to complete the test after time had been called, using a 
second coloi of pencil so that both timed and untuned scoies were available, 
The two sets of scoies conelated as follows in Guide 3, 74, Guide 5, ,83; 
Giade 7, 87, Guide 9, 93 Perhaps it appeals that these coil elutions are so 
high that the two tests measure the same thing, but making allowance for 
the nomndependence of the two measures, we estimate that m Grade 5, for 
example, at least 31 percent of the lest vaiiance is due to speed 
We cannot tell whether this is an advantage oi a disadvantage in predic¬ 
tion without knowing whethei the criterion task calls for speeded perform¬ 
ance, and what type of speed it calls for When the criterion task does not 
demand speed or demands a type of speed not involved in the test, speed¬ 
ing the test introduces an irrelevant variable For general academic criteria, 
a measure of power independent of speed is more relevant than a speeded 
score. With a long testing time, an unspeeded test is more valid than a speed 
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test covering the same material If only a short time is available for testing, 
however, a speeded test will be more reliable than an unspeeded test con¬ 
taining very few items. As a result, the short speeded test has gieater predic¬ 
tive validity than the even bnefer test that everyone can finish in the same 
time (F. M Lord, 1953) 

The tiend in lecent American tests is to provide ample time for nearly 
eveiyone to finish. This point of view is not universally accepted Eysenck 
(1953) and Fumeaux, in England, aigue that the speed with which the 
mind pioduces hypotheses is the essence of good pioblem solving, and that 
a speeded test is theiefore the best measure of mental ability 


Stability 

Test scores are unstable when behavioi patterns aie being acquired, and 
we would expect a peneil-and-papei test scoie to be unstable in the eailiest 
school yeais The reliability of the Detioit First-Grade Intelligence Test is 
,91 by a split-half method, mdicatmg good accuracy, but is only 76 when 
a retest aftei foui months is given Seagoe (1934) gives data on repeated 
measuiemcnts of vanous gioups after a two-year interval, as follows 

Detioit Fust Grade at age 6-4 with Detioit Pnmaiy at 8-8 r = 64 
Detroit Fust Grade at age 6-3, with Haggeity Delta at 8-8 i = 66 
Detioit Pnmaiy at age 8-9 with National Foim B at 10-8 r — 73 
National Foim A at 10-4 with Teiman at 12-5 t — 80 

National Foim B at 10-7 with Tcrman at 12-6 r — 87 

Piedictions of intellectual peifoimanee ovei short mteivals of time can be 
made with substantial accuracy, but the mental test peimits only approxi¬ 
mate long-range predictions in the lower giades of school Allen (1944b) 
reports, for example, that the Kuhlmann-Anderson IQ m the middle of 
Grade 1 predicts achievement early in Ciade 4 with a validity of 52 

Once the initial development of reading and seatwoik is past, group tests 
foi successive ages measure about the same thing and do so with consider¬ 
able stability, as Seagoe’s data show. By adolescence, scoies appeal to be ex- 
tiemely stable. FI, E Jones reports that scores at age 17 on the Terman 
Group Test conelate .84-.90 with retests at age 33 (J. E Andcison, 1956, 
p 159). Despite this stability, the testei should not iely on an old mental- 
test scoie when a ciilical decision is being made Some young people make 
substantial changes in mental performance over a three-year period. 

Overlap with Achievement Tests 

The Kuhlmann-Anderson, like most group tests of general ability, is closely 
related to educational status According to the test manual the test distin- 
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guishes sharply between accelerated and retaided children in the same 
grade. The concunent correlation of the test with total score on an educa¬ 
tional achievement battery, with age held constant, is .84 in one study, 77 
in another (Hilden and Skeels, 1935, Allen, 1944a). This con elation is higher 
than that of the Slanfoid-Bmet with achievement 

Some expeits argue that it is impossible to separate aptitude or intelli¬ 
gence from achievement. If intelligence and achievement tests measure the 
same things, we are only fooling ourselves by giving them diffluent labels 
We can examine this criticism by making use of two statistical principles re¬ 
liability is the propoition of test vaiiance that is nonenor variance; the valid¬ 
ity coefficient squaicd indicates what propoition of test vaiiance measures 
the same attribute as the critenon If the reliability is .86' for the Kuhl- 
mann-Anderson, and the interconelation of test with achievement is .84, we 
arrive at Figuie 37. The square of the inteicorrelation indicates the oveilap 



FIG 37 Overlap of the Kuhlmann-Anderson tost with an 
achievement measure 


in variance. Scvcnty-one pci cent of the tost vaiiance lopiesonls what the 
achievement lest measures; 14 percent is cum. Tlieiofme, 15 poioenl of the 
Kuhlmann-Andeison test variance is due to some reliably moasuiod ability 
independent of achievement. The lest would leporl some difiei cncos among 
children having identical school achievement, but among cliildicn with the 
same achievement about half of the individual dijjacnccs in IQ are due only 
to random errors A similai conclusion would hold for other heterogeneous 
group tests of mental ability (Coleman and Cureton, 1954) For most chil- 

1 A coefficient of equivalence computed from subtest mtcrcorrelations for Grade 5, by 
a modified analysis-of-vanance method The coefficient reported in the manual, 94, Is 
based on a split-half technique which should not be applied to speeded tests, 
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dren, a group mental test leads to die same prediction that a comprehensive 
achievement test would 


Ability to Predict Vocational Performance 


The group mental test is cleaily impoitant to vocational guidance since it 
pi edicts success on a gieat many jobs. In some jobs, however, special abilities 
are much moie impoitant than geneial ability, and in some loutme occupa¬ 
tions woikeis with vciy high intelligence aie less suitable than persons of 
mediocie ability. 

The mental ability found m vaiious occupations is shown m Figure 38, 
based on test scores of Aimy draftees Though persons in lnghei-level occu- 
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pations Lave greater average ability, there is an extreme range within each 
occupation 

Just what will happen to a boy with a given IQ is difficult to predict He 
may do well in school and college and entei a profession, or be may drop out 
of school and remain m an unskilled job. Figme 39 is based on unpublished 
data from a follow-up study of students who graduated from high school in 
Flint, Michigan, in 1943. Ten years later the investigator located and ob¬ 
tained information about the subsequent careers of 97 boys The, figure shows 
what happened to boys at each ability level 

The boys are divided into thiee levels accoiding to Kuhlmann-Andeison 
IQ in Giade 9. For each group, the figure shows high-school grades, college 
history, and occupation ten yeais after graduation The chart ments detailed 
study; it indicates the uncertain piedictive value of high-school test scores, 
while at the same time showing that they have a definite ldation to futuie 
success. We shall mention only a few of the relations that can be tiaced in 
the figure Theie is appreciable conespondence between IQ and grades, 
piactically no one in the lowest IQ level earned supenor maiks Boys in the 
lowest group did not entei college unless their guides were exceptional, and 
they were moic likely than tlic olliei groups to be in unskilled jobs. About 
one-third of the gioup with IQ 90-104 enteied college, and half of them 
graduated The group with supenor IQs earned betlei high-school marks, 
but they did no bettei in college than students with average nintli-giade IQs 
and similar higli-school giadcs Moreovei, the occupational status of high 
and middle IQ gioups who went to college is the same Among those who 
did not go to college the occupational level concsponds somewhat to IQ. 
The most stiikmg finding is that, legaldless of IQ or lngh-school average, ev¬ 
ery student who finished college was m an uppei-level occupation ten yeais 
after completing high school The predictive significance oi a ninth-guide IQ 
would differ somewhat m othei times and other places, it would be desirable 
foi any high-school counselor to peiform his own follow-up study in ordei to 
establish expectancies foi his school. 

The fact that boys with TQs below 100 can succeed in college is haul to ex¬ 
plain in any geneuil way, but the individual eases often aie quite, undci- 
standable Alex, though he had an IQ of 93 in (hade 9, eventually became a 
lawyer The IQ was not inaccurate he had 93 on a iciest sonic months lutci, 
and 113 on the Stanlord-Bmet Alex had lived m a boaidmg home during his 
eaily school yeais following the death of his mothei, and sultcned fiom a 
sense of inadequacy which led him into aggiessive, offensive bchavioi. A 
counselor felt that Alex had ability even though Ins tests and giadcs weie 
poor Under the counselors friendly encouragement he impioved Ins marks 
to the B level and transferred to a college-prepaiatory curriculum His per¬ 
sonal adjustment also improved. Aftei war service Alex entered college and 
completed his law course successfully (Cantom, 1954) 
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The rather laige numbei of “late bloomers” like Alex warns against making 
a definite and final sepaiation between students with high and low ability at 
the start of high school. It is haid to teach complex ideas to dull pupils, and 
their piesence m the mathematics or Fiench class will impede the teaching 
of the ablest. Many potentially able students, though, will not be recognized 
in the ninth grade Any gioupmg plan must make provision foi the student 
whose ability is discoveied midway through high school. lie must be able to 
fulfill college requirements without too much loss of time; otherwise, much 
of his talent will be wasted 

The facts so fai piescnted beai on differences between occupations but 
not on those within occupations. Many validation studies aic available for 
specific vocations One especially interesting lesult comes from a follow-up 
of woikers m the home office of an msuiance company. Ncaily 700 woikeis 
hired between 1937 and 1949 weie tested on a short general mental test at 
that time New workcis entci in the lowci job categories and are piomoted 
as their peifoimance shows men! The conelalion between responsibility 
held in 1954 and scoie at time of hiring was 60. Fifty-four percent of those 
in “decision-making jobs” had had scores of 120 and ovei, only 5 pei cent with 
scoies 0-99, and 19 peiccnt in the 100-119 ninge, held those lngli-iunking jobs 
(Knauft, 1955) 

Ghiselli (1955) leviewed the entne hlciaturc on prediction of success of 
woikeis and found that the group mental test pi edicts both limning and pci- 
foimance ciitcna foi many jobs The coefficients foi any job title lunge fiom 
very high to negligible, depending upon the range of ability m the gioup 
tested and the demands of the specific job Average validities fin gioup men¬ 
tal tests against job proficiency fall in the following ranges 

00 to 19 Sales, seivicc occupations, machining woikeis, pjckeis and wrap¬ 
pers, lcpairmen 

.20 to 34 Supeivisois, clciks, assembles 

35 to 47 Elecliical woikers, manageiial and piofcssional 

Somewhat similar results aic reported by the USES. Conelalions foi gcncial 
mental ability me above 40 fin success of automobile mechanics, key-punch 
opcratois, practical nurses, and bindeiy workeis, lor example In contrast, 
conelalions aie below .15 for elecliomcs pints assembleis, welders, polloiy 
deeoiators, and meat-packing woikers (Guide to the Vue of GATB, 1958). 

6. Characterize the occupations for which general ability is a good predictor 

NOTEWORTHY GROUP TESTS 

The tests listed below are a representative sample including some of the 
good current tests, tests piimauly of historical importance, and tests illustrat- 
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mg novel measurement techniques. The descriptions aie designed to indicate 
some of the ways the tests differ rather than to provide a full review of im- 
poitant qualities, and the leader should obtain fuller information from the 
test manual and fiom reviews of any test that interests him. 

@ Ameiican Council Psychological Examination (ACE); L. L. and T G. 
Thuistone; Educational Testing Seivice, 1924, fiequent levisions. For college 
entiants, high-school foim also available This test was foimeily doe pnncipal 
instrument used m testing college fieshmen and m reseaich on college suc¬ 
cess. The total score predicts giade averages, usually with validity about 
,45 Theie aie two pait scoies. L (linguistic), based on vocabulary, verbal 
analogies, etc., and Q (quantitative), based on numbei seiies, figuie anal¬ 
ogies, etc. No consistent evidence was found tiiat the L and Q scores pre¬ 
dicted success m veibal and scientific subjects, respectively, as intended 
(Beiche et al., 1951). Because the pait scoies weie of little value and 
the many subtests awkwaul to administer, die test is now being supplanted. 

• Aimy Alpha Examination, vaiious aulhois, levisions, and publisheis, 
cuiienlly distributed by Western Psychological Seivices, Psychological Cor¬ 
poration, 1916, 1939, eL seq. Foi secondary school and adult use Onginally 
designed foi Aimy gioup testing, the lest has several speeded subtests call¬ 
ing foi information, reasoning, and piactical judgment Has no advantage 
over moie modem tests. 

® California Test of Mental Mali mty (CTMM); E. T. Sullivan, W. W. 
Claik, E. W Ticgs, California Test Bureau, 1936, 1957. Levels fiom kindei- 
gailun to adult One of the most widely accepted cunent tests, with unusual 
vanety of items, good foimat and standardization, and a continuous seiies 
of levels. The full lest lequues ovei one and one-half hours at school ages 
Theie is a Slioil Foim for use wheie less leliable measuiement is acceptable 
Sepaiate "Language” and “Non-Language” IQs are offeied, but theie is little 
evidence to indicate the piactical significance of diffeiences between the 
two TQs. Subscoies foi memoiy, logical masoning, etc , attempt to provide a 
piofile of abilities, but these subscores have dubious validity and should be 
given little attention By standaidizing CTMM along with the California 
Achievement Tests, the authors piovide foi companson of the pupil’s attain¬ 
ment scores with the expectancy for Ins IQ level (see p. 387). 

® College Qualification Tests; Gcoige K Bennett and otheis, Psychologi¬ 
cal Coipoiation, 1957 College and piecollege An eighty-minute test de¬ 
signed for measuring general scholastic piomise of college applicants and 
students. In addition to the total, the part scores measure verbal ability, nu¬ 
merical reasoning, and infoimation m thiee fields. This, like SCAT, is essen¬ 
tially a sample of educational attainments Noims aie piovided for various 
types of college and also for uppei years of high school The total score pre¬ 
dicts fieshman grade average, validity often being above .60. 
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9 Concept Mastery Test; Lewis M. Terman, Psychological Corporation, 
1939, 1956, College juniors and above. An untuned test designed to measure 
the highest ranges of vocabulary and verbal reasoning 

• Cooperative School and College Ability Tests (SCAT); anon., 2 ETS, 
1955. Grade 4 to college. This test is offered to replace the ACE as a device 
primarily for predicting academic success A Verbal scoie measures vocabu¬ 
lary and reading comprehension; a Quantitative scoie measures arithmetic 
reasoning and understanding of arithmetic operations. Both measure school- 
learned abilities. The tests are well prepared, but validity, stability, and dis¬ 
criminating power in exceptional gtoups have not yet been thoioughly in¬ 
vestigated. They should serve well foi selecting potential college-goers, and 
appear highly suitable for use by educators not psychologically trained. 

® Cultuie-Frec Intelligence Tests; R. B Cattell, IPAT, 1933, 1944, 1950 
Age 4 to adult, 3 levels A nonverbal test including matnces and other ma¬ 
soning tasks with geometnc figuies The test is independent of language skill 
but is not truly fiee of cultuial influences, Norms foi the test are unsatisfac¬ 
tory; IQs have a very laige s.d 

• Davis-Eells Games, Allison Davis and Kenneth Eells; Woild Book, 
1953. Grades 1-2, 3-6. Items aic designed to be mlei testing and fair to lower- 
class children (see below) Pioblems aic piescnted picloiially nithcu than 
verbally, and the test is relatively difficult to ndimmstoi Though the test is 
long, the reliability is lower than that for competing tests The test docs not 
predict academic peifoimance under present leaching methods as well as 
veibal tests do, but is designed to locate children foi whom new teaching ap¬ 
proaches are needed 

• Henmon-Nelson Tests of Mental Ability, Tom A Lamke and M, J. Nel¬ 
son, Houghton Mifflin, 1931, 1950, 1957 Guides 3-6, 6-9, 9-12, college A 
thirty-minute test of the “spiral omnibus” pattern in which vauous item types 
are presented m lotated oiclei with a steady rise in difficulty Items include 
information, proverb inteipietation, figuic analogies, following duections, 
etc Caibon-sheel method of quick sconng The 1957 levision is well de¬ 
signed as a shoit mcasiue ol scholastic ability having lelialnhly ovei .90, but 
considerable overlap with leading ability and no diagnostic lectures, 

• Kuhlmann-Andeison Intelligence 'I ests, F Kulilinanii and Rose G. 
Anderson, Personnel Pi ess, 1927, 1952 Age 6 to malm ity (See pp 218 ff.) 

• Lorge-Thorndike Intelligence Tests, living J.oige and Rolieit L. Thorn¬ 
dike, Houghton Mifflin, 1951 Levels horn kindeigaiten tlnougli high school. 
A well-constructed test At the piimaiy level, questions lequiung veibal un¬ 
derstanding and reasoning aic lead by the leachei, and the pupil responds 

An entry of anon indicates that the test was piepared by the staff of some organiza¬ 
tion. in tins case, the Educational Testing Service The responsibility for test design is 
shared so widely that listing the many cooperating authors would not be informative 
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by marking pictures. In Grades 4 and above, nonverbal and verbal sections 
can be separately administered. The nonverbal items call mostly for general 
ability, independent of vocabulary and reading Since the verbal and non¬ 
verbal scores correlate about ,70, differences between the scores will not be 
significant foi the majoiity of pupils. 

« Miller Analogies Test; W. S Miller; Psychological Corporation, 1926, 
1950 Supenoi adults. A test of 100 very difficult verbal analogies items, ad- 
mmisteied only at licensed centers. Seveie restrictions protect the security of 
items, since it is used by many graduate schools to test applicants Sizable 
validity coefficients for predicting success m graduate study are reported, de¬ 
spite the nanow range of ability within which the test is designed to discrimi¬ 
nate. 

e Ohio State Umveisity Psychological Examination (OSPE), H. A Toops, 
Ohio College Association, 1919, frequent levisions SRA publishes Foim 21 
(1940), the Minnesota Scholastic Aptitude Test is a shoitened version of 
Foim 23 High school, college. By restricting items to vocabulary and read¬ 
ing ability, the authoi obtains a scoie which predicts college marks with un¬ 
usual accuracy (coefficients of 60 aie common) The test lequires about two 
houis. 

® Otis Quick-Scoring Mental Ability Tests, A. S Otis, Woild Book, 1920, 
1936, 1954 Foims for Giades 1 to college Otis was one of the fust to experi¬ 
ment with gioup mcasiuement techniques. His tests geneially combme vei- 
bal and nonvcibal reasoning items to obtain a quick measuie of geneial abil¬ 
ity. IQs tend to be lowci than for other tests The technical development and 
the manuals for these tests aie less adequate than for tests of lecent origin, 
but predictive validities against school achievement compare favorably with 
othei tests 

« Pintner Geneial Ability Tests, Rudolf Pintner, Woild Book, 1931, 1945 
Grades 4-9 Tlicie aie sepaiate language and nonlanguage tests, each requir¬ 
ing about 45 minutes. Theie aie companion tests for lower giades, namely, 
the Pmtnei-Cunningham Piiinaiy Test and the Pintner-Duiost Elementary 
Test The latter contains two subscores, one being based on verbal leasonmg 
items load by the pupil. The othei measuies vocabulary and verbal reason¬ 
ing independent of leading skill by having the pupil mark pictures m re¬ 
sponse to questions lead by the teachei The two scoies give significantly 
different mfoiinaLion The veibal test for mteimediate giades is much like 
other oldei tests, The nonlanguage test contains six subtests, some of which 
aie ingenious and taxing reasoning tests The nonlanguage scoie adds more 
unique mfoimation to data noimally available from achievement tests than 
does the veibal test oi the usual omnibus intelligence test 

e Progressive Matrices; J. C Raven, H K Lewis (London), Psychologi¬ 
cal Corporation, 1938, 1947, 1951 Ages 5% to 11, age 9 upward One of the 
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best available techniques for obtaining a nonverbal measure of reasoning 
ability, though the single type of item places a possibly undesiiable emphasis 
on spatial reasoning The noims are based on poorly selected groups. Reli¬ 
ability of the scale in single age groups, especially young ones, is inadequate. 
An efficient, piopeily standardized foim is badly needed, 

• Scholastic Aptitude Test (SAT); anon , College Enhance Examination 
Board, 1926 to date. This examination is administeicd m a controlled pro¬ 
gram to applicants for admission to affiliated colleges Not sold foi general 
use Tests measure vocabulary, verbal masoning, knowledge of high-school 
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FIG 40 Nonverbal Item for the Semantic Test of Intelligence (Rulon, 1952 Reproduced by 
permission of Dr Phillip J Rulon ) 


mathematics, and quantitative leasoning, combining eompiohonsion items 
with subtle reasoning in oidci to disemninate at high levels' o( ability Within 
the group smviving to lire lmnth yem oi college, SAT-Veibul con elates .43 
with giade average in the typical college, and SAT-M (Quantitative) cor- 
lclates 27 (J. Fiench, 1957). 

® Semantic Test of Intelligence, P. J Rulon, unpublished, 1952. Adult. 
A nomeading test for testing conceptual reasoning, designed to detonline 
winch llhtciates m the Anny should have literacy tunning (This is part of 
an effoit to utilize diaflees who would otheiwise have to be rejected.) The 
man is taught by pantomime the meaning of certain symbols, and works 
through a senes of “decoding” problems, beginning with two-choice items 
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In Figure 40, the upper panel shows that the first symbol stands for “cow”, 
the first phiase is “cow jumping,” and the fouith picture should be ended. 
Correlations show that this woiksample of learning is more a measure of abil¬ 
ity to lcain, and less a measuie of attainment, than other mental tests (Rulon 
and Schweikei, 1953). 

• Teiman-McNemar Test of Menial Ability, L M Terman and Quinn 
McNemar, Woild Book, 1940, 1949 Grades 7-12 A well-constructed 
test, reliable and easily inteipretcd. Restiictcd to veibal reasoning and m- 
foimation in Older to predict school marks 

7. In an adult counseling center, adults with varying educational backgrounds 
and vocational goals must be given advisement. Prepare what you consider a 
minimum list of intelligence tests (group and individual) needed to cope with 
all nonpsychiatric cases. 

8. Prepare a minimum list of intelligence tests needed by a school psychologist 
who is expected to diagnose any pupil, age 6 to 16, whose behavior or school 
work is considered unsatisfactory. 

9. Would a group or an individual test be preferable 

a in screening applicants for teaching positions in a large city? 
b. in testing |uvenile delinquents prior to decisions about probation 7 
e. in research on trends in the intelligence of immigrants 7 
d. in selecting secretarial employees for a university? 

10 What interpretation would be made if a child has Non-Language IQ 120, 
Language IQ 90? What interpretation would be made if the Language IQ 
were 120, Non-Language 90? 

11. In the CTMM (Elementary) test, pupils listen to a story about “The Pack Train." 
In the story, a man goes to a mining camp by pack train, passing a glacier and 
being threatened by a grizzly bear. After hearing the story, pupils go on to 
take other sections of the test After an elapsed time of 25 minutes, the pupils 
are asked questions about the story What does the test measure besides gen¬ 
eral ability in "delayed recall"? 


USE AND FUTURE PROSPECTS OF ABILITY TESTS 

After a generation of enthusiastic acceptance, gioup tests of ability have 
come undei attack from many quaiteis. One challenge comes fiom the ana¬ 
lytic measuics of clifTcicntiatcd abilities, which hope to offer accurate de¬ 
scriptions of patterns of ability m place of the oveiall index of the omnibus 
test. The other piincipal challenge glows out of the lecognition that the tests 
—at least fiom age 8 to 20—are stiongly influenced by past school achieve¬ 
ment 


Chief Functions of Group Mental Tests 

If one is comparing students who have been in the same class, the high coi- 
relation between geneial ability tests and achievement batteiies means that 
it makes little difference which we use, since they lead to similar decisions 
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When one compares persons coming from different educational back¬ 
grounds, the general ability test is often much the moie suitable because it 
is not matched to any particular educational expenence Among the impor¬ 
tant functions of the general mental test aie these: 

® Comparing pupils at the beginning of a school yeai. The mental test 
is fair to pupils coming from vanous schools, whereas an achievement bat¬ 
tery might not be 

® Decisions regarding the admission of college students. Iligh-sehool 
giades or rank in lngh-school class usually predict bcttei than mental tests 
but it is hard to compare giades fiom dilfeient schools, especially small ones, 
A combination of high-school giades with a gioup mental test commonly pie- 
dicts college maiks with a validity of GO to 70 (D Hams, 1940). Achieve¬ 
ment batteries can be substituted foi the mental test as a piedictor Seveial 
studies of the Iowa Tests oi Educational Development find validities of .60- 
.70 foi the test alone (Using the lotto Tests, 1957), but tins test lecpmes sev¬ 
eral hours whereas the usual college-level mental test takes fiom 45 minutes 
to two houis 

• Selecting employees. The fact that persons have dilfeient school back- 
giounds makes an achievement battciy unsuitable, especially wlieie the job 
depends little on school learning. General mental tests aie often nunc ac¬ 
ceptable to adult job applicants than a test lcmimscenl of schoolwoik 
would be. 

• In reseaich, foi dividing subjects into gioups of equal ability so as to 
compare diffcient methods of instiuction oi to study effects oi motivation, 
etc. 

The Spectrum of Ability Tests 

In the descuptions of tests labeled as measuies of gcneial mental ability, 
scholastic aptitude, or intelligence, it is appaicnt that that name covcis a 
consideiable vanety of lest content Tests can be ai ranged m approximately 
the pattern shown in Figuie 41, along a spcctium langmg fiom those which 
aie stuctly measuies ol outcomes ol education to those which aie most inde¬ 
pendent of specific instinctual Foi the sake of conliast, we anchoi the scale 
at the educational end (A) with tests of subject-mattei piofieieney which 
mensine how much the pupil knows about particulai couises such as algebra 
and physics A few tests (notably CQT) have measured inhumation in sub¬ 
ject fields as a pait of “scholastic aptitude” tests Tests of geneial educational 
development are next in oidei (B) These tests, to be described in Chapter 
13, measure general abilities and study skills which might be acquired in 
many different couises such as ability to interpret graphs and charts, ability 
to compiehend and draw conclusions from scientific ai tides, etc. At C, the 
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tests measure educational proficiencies such as size of vocabulary and arith¬ 
metic reasoning, these intellectual tools are even more fundamental to intel¬ 
lectual work than those at B At D we begin to move away from things di¬ 
rectly taught in school, the tests present puzzling verbal problems which 
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FIG 41 Spectrum for comparing tests of scholastic aptitude 


require the student to reorganize knowledge The distinction between C and 
D tasks may be illustrated by vocabulary items A C item would ask the 
meaning of a fairly rare word—foi example, requmng choice of the neaiest 
synonym for the given word, as m this Lorge-Thorndike item 

subvention meeting support change cimnnal lessee 

A D item uses woids known to most pupils in the grade tested but requiies 
a difficult comparison Foi example, the DAT Verbal Reasoning Test (Ben¬ 
nett et al, 1947) requiies choice of two words to complete an analogy 

, is to static as dynamic is to 

1. radio 2. politic 3. ineit 4 air 

A speaker B, motoi C regal D active 

At E, wc come to tasks which lequire leasoning with abstiact concepts but 
which lequire little if any familiarity with the examiner’s language. As we 
move towaid F, we attempt to emphasize concepts and experiences familiar 
to every subject, while still lequiring careful reasonmg 

Spearman’s g seems to correspond closely to E on this spectrum; it is con- 
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cerned with the ability of the individual to form or detect relationships and 
abstractions. Most group tests combine tasks at levels D and E, sometimes 
in separate verbal and nonverbal scores. Many recent tests, paiticulaily for 
college-level students, move up the spectium to B and C levels, thus becom¬ 
ing more directly a reflection of how well the student has done in past school¬ 
ing. Tests for the primary levels, on the other hand, aie necessauly concen¬ 
trated at E and F. A few tests have succeeded m developing items of type F 
for older subjects. 

The functions of the tests at the two ends of the spectrum are different, 
Those toward the top aie designed for cold-blooded prediction of future 
school success One who has done pooily m past schooling is a bad bet for the 
futuie, no matter what his "intelligence” may be Those who admit students 
to college or award college scholarships raiely lake a chance on the student 
“who would succeed if he tinned ovei a new leaf.” They piefci the test which 
deliberately handicaps the student who lias had pool schooling 01 has taken 
little advantage of it, On the other hand, the teacher and counselor working 
with a student wants to know what undeveloped resources lie has They can 
rely on past achievement foi an estimate of probable futuie accomplishment 
when nothing out-of-the-ioutme is done foi the student, 1ml the mental test 
ought also to locate undeveloped potential that novel treatment may bring 
out Foi the lattei purpose the most information is provided by tests of types 
E and F which have a minimum of oveilap with achievement. Tests m the 
range from D to F aie prefeiable when it is neccssaiy to compare persons 
coming fiom different educational and cultural backgrounds The moio dif¬ 
ferent the backgrounds, the farlliei toward F the test should be, unless the 
critenon task requires some particular background. 

12. Classify the Kuhlmann-Anderson subtests illustrated in Figure 36 on the spec¬ 
trum. 

The Shift Toward Achievement Tests for Academic Prediction 

It is evident that tests of typos D and E have much overlap with tests at 
B and C, but the effort of test dovelopeis, beginning with Bind, was di- 
lecled to measuring "mental ability” as distinct from achievement. Recently, 
many test developers have come to the conclusion that tins is not a profitable 
cndcavoi when one’s aim is to pi edict school success. These woikcrs are now 
recommending tests of types B and C for this pm pose. The ACE test, for ex¬ 
ample, was foi a long time the principal instrument used by colleges for 
measuring scholastic aptitude of entering freshmen In 1955 its publishers, 
acting on the advice of specialists in educational guidance, intioduced in its 
place the School and College Ability Tests (SCAT) 

The SCAT is a measure of verbal and arithmetic comprehension; these 
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abilities played a large part also in the ACE, but that test included number 
series, figure analogies, and other reasoning tasks not directly taught in 
school. According to the manual (Cooperative Test Division, 1955, pp 3-5): 

The tests m the SCAT series have been designed and developed for 
the principal purpose of helping teachers and counselors—and students 
themselves—to estimate the capacity of each individual student [in 
high school and college] to undertake the academic work of the next 
highei level of schooling. . . . 

In considering the geneial purposes foi which the SCAT series was 
to be designed, and the continuity of measurement that was to be a 
principal objective in development of the seiies, the Advisoiy Commit¬ 
tee lecommcnded stiongly that the new tests should measuie “school- 
learned abilities” duectly, rathei than psychological characteristics or 
traits which afford mdirect measuiement of capacity for school learning. 
This recommendation was based on tlnee geneial observations shared 
by all membeis of the committee' (a) that the best single predictor of 
how well a student is likely to succeed in his school woik next year is 
“how well lie is succeeding this year”; (b) that a certain few school- 
learned abilities appeal to be cntical preiequisites to next steps m learn¬ 
ing thioughout the range of geneial education—among them skills m 
reading and m handling quantitative information; and (c) that school- 
learned abilities usually can be discussed with students and parents in 
a more objective way than can such emotionally-loaded characteristics 
as “intelligence” or “mental ability.” 

Demand for Ability Tests Independent of Social 
and Educational Background 

While one gioup of testeis pioposes to abandon geneial ability tests as un- 
necessaiy, many olheis have taken the position that the solution to the piob- 
lem of oveilap is to make general ability tests less dependent on backgiound 
even if this reduces their correlation with subsequent school success. 

Theie has been paiticulai dissatisfaction with the tests foi high levels of 
ability, most of which are measures of infoimation at least as much as they 
aie measuies of thinking power. It is not a simple matter to invent unambig¬ 
uous difficult items, and most testers raise difficulty by liicieasmg the degree 
of speeding or by mtioducing items which depend on raie knowledge. The 
Concept Mastery Test and the Miller Analogies Test, foi example, require 
the subject to have a veiy laige vocabulary The matiix test can be made 
veiy difficult, but no high-level veision of it has been published. 

A related defect is the failure of present tests, whether gioup or individ¬ 
ual, to measure creative ability (Thurstone, 1951), Effoits to test types of 
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thinking characterized by logic, accuracy, and knowledge have been very 
successful. There has been little success in identifying the types of thinking 
which distinguish the novelist from the engineer or the theorist from the mu¬ 
seum curator. This is probably of minor importance at younger ages where 
all abilities seem to bo highly coirelated, although even here we pcihaps oc¬ 
casionally overlook a child who has high potential along lines not stressed m 
our tests. In vocational guidance, we have little basis foi judging which man 
will be most insightful, most creative, and most original 

A much more severe criticism is made by Allison Davis, who argues that 
use of tests dependent upon past schooling and scliool-ielated behavior 
denies many children a fair opportunity Childien who do well on mental 
tests are encouiaged by teachers, and if they have trouble with school work 
a special study of their difficulties is made If a child with a poor mental-test 
record, on the other hand, has double with schoolwoik, the teacher is likely 
to accept this as natuial and make no deeper inquiry The child who could 
do bettei sclioolwork than he has in the past is neglected just because the 
poor backgiound lowers Ins mental-test score 
Davis (1951) and his associates believe that Amoucan society contains 
several cultural segments, of winch the hugest and most distinctive aie the 
middle class and the lowci or "walking” class The former group, consisting 
of professional, skilled, and whitc-collai woikers, values education as a 
means of maintaining a desirable place in society. On mental tests, the aver¬ 
age middle-class child does bettei than the aveiago lower-class child. Davis 
thinks that this social-class difference results fioin the way the tests are con- 
stiucted rather than fiom deficiencies in reasoning ability among the lower- 
class childien Tests, he says, are biased against the lowci-class child (1951, 
P 15): 

The type of problem in present tests, which is clearly biased, may he 
illustrated by the following. 

A symphony is to a composei as a book is to what? 

( ) paper ( )soulploi ( )nuthoi ( )musician ( )man 

On this pioblem 81 percent of the higher socio-economic groups 
maiked the coned lesponse, but only 51 poteenl of the lowci socio¬ 
economic gioup did so. In an exponinent designed by Piofessor Ernest 
flaggaid we made a pioblem snnilai to that just read, but we used 
words and situations common to all social groups of children This prob¬ 
lem was lead to the pupils. 

A baker goes with biead, like a carpenlei goes with what? 

( ) a saw ( ) a house ( ) a spoon ( ) a nail ( )aman 

On this culturally fair problem, 50 percent of each socioeconomic 
group gave the correct answer. 
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This criticism of test content implies that some types of reasoning tests 
handicap the lower-class child moie than others There are two major stud¬ 
ies which bear on this contention, both of them earned out by associates of 
Davis Havighuist and Janke applied individual tests to all the 10-year-olds 
and nearly all the 16-year-olds in a midwestern town. The aveiage IQs aie 
shown in Table 26. In this table, class C consists of families of white-collar 


TABLE 26 Average IQ of Middle- and Lower-Class Groups 


10-Year-Olds 

16-Year-Olds 

Cornell-Coxe Porteus 

Social Stanford- Performance Draw- Maze 

Status N Bmet Battery a-Man (MA) 

Stanford- Wechsler- 
N Binet Bellevue 

C 26 114 116 107 128 

D 68 110 110 102 128 

E 16 91 96 91 104 

44 112 109 

49 104 102 

13 98 103 

Souhcf Iluvifihurst and Jnnke, 1944, 1945 


woikeis and small businessmen, D, of semiskilled workers and laboieis, and 
E, of the lowest occupational gioups and the down-and-outeis All the tests 
except the Wechslei-Bellcvue show about the same diffeience between 
classes. Tile peifoimance tests do not give a substantially moie favorable 
pictuie of the lower-class child than does the Stanfoid-Bmct Whatevei cul- 
tuial handicaps oi hci editary handicaps there may be seem to be present in 
both veibal and perfoimance measures The diffeience between Bmet and 
Wechslei results foi 16-year-olds is not laige enough to requiie explanation 

The second study (Eells et al., 1951) involved the administration of nu- 
meious gioup tests to a very large number of pupils. As expected, the scores 
coirelated with social status ( 33 for the Kuhlmann-Andeison). This diffei¬ 
ence was found on all types of items Ninety-one percent of the items for 
16-yeai-olds, and 63 percent for 10-year-olds, were easier for the middle-class 
child. Although veibal items showed a slightly greater difference, this study 
also implies that the handicap of the lowei class is not pnmailly a function 
of test content 

Eells made anolliei obseivation, howevei, which points to differences m 
lest motivation as a possible source of inaccurate comparison High-status 
pupils tended consistently to select the most plausible mcoirect choices 
( 0 ‘disti actors”) whereas the low-status pupils scattered responses widely over 
all wiong choices This seems to indicate that the lower group guesses more, 
and puts forth less effort on hard items This pioblem of diffeiential motiva¬ 
tion is one alieady mentioned m connection with study of national and lacial 
diffeiences Tests aie constructed to predict readiness foi academic school¬ 
ing and therefore emphasize abstract ideas, careful self-criticism, and will¬ 
ingness to woik at a task which offers no visible reward Davis’ case studies 
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indicate that working-class children live in a world concerned with concrete 
problems where sound thinking and enors meet with tangible rewards and 
penalties. I-Iavighurst comments as follows on the motivational differences 
between classes (Eells et al, 1951, p. 21): 

The chaiacteristic middle-class attitude towaid education is taught 
by middle-class parents to their children. School is important for future 
success, One must do one’s very best in school. Repoit cards are studied 
by the parents carefully, and the paicnts give rewauls for good grades, 
warnings and penalties foi pool guides. Lower-class parents, on the 
other hand, seldom push the children haid in school and do not show 
by example oi by pieccpt that they believe education is highly nnpoi- 
tant, In fact, they usually show the opposite attitude With the excep¬ 
tion of a mmonly who uigcntly dcsne mobility for then childicn, lower- 
class paients tend to place little value on high achievement m school or 
on school attendance beyond minimum age. 

When the middle-class child comes to a tesL, lie has been taught to 
do his very best on it. Life sketches ahead of lnm as a long senes of 
tests, and lie must always wmk himself to the veiy limit on them. To 
the average lower-class child, on the othei hand, a test is just anothei 
place to be punished, to have one’s weaknesses shown up, to be re¬ 
minded that one is at the tail end of the procession. Hence this child 
soon learns to accept the inevitable and to get it ovei with as quickly as 
possible Obseivatum of the peifoimance ol knvoi-class clnldien on 
speed tests leads one to suspect that such clnldien oilen woik very 
lapidly tlnough a test, making responses more oi less at random. Ap¬ 
parently they aie convinced in advance that they cannot do well on 
the test, and they find that by getting tlnough the test rapidly they can 
shoiten the peiiod of discomfoil which it produces 

In an off oil to piovide a test which will not seem unduly abstract and 
schoohsh to Iowci-class clnldien, and winch will not penalize them for past 
indifference to schooling, Davis and Eells developed a new gump test The 
"Davis-Eells Games” icqime masoning, but they deal with everyday situa¬ 
tions rather than abstractions (Figure 42). The items appeal to lie mterest- 
mg to pupils, appealing to much the same motivation as a comic strip satis¬ 
fies Just how well the test achieves its aim is difficult to judge from present 
evidence It is much less dependent on leading than tire Kuhlmann-Ander- 
son, which is itself less verbal than many gioup tests (Love and Beach, 
1957) One study finds that lowei-class children lag just as far behind the 
middle-class gioup on the Davis-Eells as on a conventional test, another 
finds a smaller correlation with social class for the Davis-Eells than for the 
conventional test, and a third finds just the opposite, the lower-class group 
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having a relatively gieatei handicap on the Davis-Eells (Coleman and 
Ward, 1955; Noll, 1958, Fowler, 1957). These discrepancies may be due to 
community differences, but they make it clear that we have much to learn 
about the implications of social class for test mteipietation. 



“This picture shows a woman, It shows a man with a bump on his head, and it 
shows a broken window A boy is outside the window Look at the picture and 
find out the thing that is true 

No 1 The man fell down and hit his head 

No 2. The ball came through the window and hit the man's head 
No 3 The picture does not show how the man got the bump on his head No¬ 
body can tell because the picture doesn't show how the man got the 
bump " 

(No. 2 is scored as right) 


lfll| 


HH 


m 

Hi 

SutBSi 

;hA|V 

sgllilll 

IB 11ml 

ra 


"Each boy is trying to take three packages home Which boy is starting to load 
the packages the best way so he can take all three home?" 

(No 3 is scored as right) 


FIG 42 Specimen items from the Davis-Eells Games Questions below the figure 
are read aloud to the group by the tester (Copyright 1953, World Book Com¬ 
pany Reproduced by permission.) 

Placed alongside the arguments which led to the constiuction of SCAT, 
Davis’ aiguments biing out an issue hidden beneath pievious testing prac¬ 
tice Evei since Binet, mental testeis have tiled to ride in two dnections at 
once, They try to predict school success and therefoie include measuies of 
educational skill m then tests But they also ask the same tests to measure 
a psychological attribute which is thought of as distinct from educational 
attainment Most present tests are a muddled combination of predictive 
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measures which rest upon past achievement, of measures unrelated to either 
past or future achievement, and of measures which predict future per¬ 
formance but do not depend on past schooling Despite their ambiguity 
these tests serve many purposes fairly well, paiticularly when other de¬ 
pendable information is lacking, as is the case m employee selection and 
recruit classification. Other information is available in most school counsel¬ 
ing, however, the potential user then ought to ask what the mental test offers 
that adds to the other data. This question forces the test developer to take a 
clear position He can develop a superior measuie of educated skills, 01 
he can develop a superior measure of unschooled abilities. Either of these 
can add to the understanding of the pupil m a way that no poorly defined 
composite can 

The argument for basing educational decisions (selection, placement, 
guidance) on achievement tests expresses a conservative outlook (Cron- 
bach, 1957) One who uses such tests takes as his task to predict who will do 
well in school and society as now constituted In his eyes, the tests are un- 
fau only if lower-class children do better m school than then test scoies 
forecast Investigations show just the contrary, when test semes arc matched 
the middle-class duldmi do somewhat better m school (Turnbull, 1951), if 
anything, the tests do not give the middle-class group enough advantage. 
Stioud (1942) concluded that “for purposes ol prediction ot success m 
schools as now organized, intelligence tests appraise the ability of un- 
favoied gioups as fairly as they appiaise the ability of the average oi the 
favoied groups and although the low average intelligence-quotient of 
the unfavoied groups may be the fault of society oi of biology, it is not due 
to unfairness inherent m the intelligence tests” (italics ours) The home, 
the school, and the woild of business all demand winking foi uunote goals 
by means of abstract ideas The usual group mental tests show how well a 
pupil is likely to fit into that system 

Davis’ position attacks this conseivative philosophy lie believes that 
society should be fitted to the individual If mu schooling calls foi thinking 
and motivational patterns that fit only (lie child of middle-class paients, we 
may be neglecting oui lesponsilnlily to diseovei teaching methods that will 
bring out the ability of the lowei-class child Fundamental as this uigumenl 
is, it has limited practical significance at this moment, bee ause neither Davis 
noi anyone else has suggested wluil educational methods should be used 
with the lower class When and if new methods foi this pm pose aie lound, 
the tests that piedict success may be diffcient fimn those now used by 
schools. 

Much theoietical knowledge is required on the 1 elation of ability patterns 
to choice of teaching method Ideally one would like to put each person 
into tliat type of instruction wheie he will do best Theie is a great need for 
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studies of the con elation of tests with success under various methods of in- 
stmction A test which correlates higher under one procedure than another 
is needed if we wish to allocate the pupil to the method best for him 
There has been no systematic lesearch on validity taking method of instruc¬ 
tion into account, 

13. In 1958, a committee of testing specialists made the following recommenda¬ 
tions (among others) to state school officials regarding desirable testing pro¬ 
grams (Identification and Guidance of Able Students, 1958) Classify the rec¬ 
ommended tests on the spectrum and indicate what published tests seem to 
meet these specifications 

a. For selecting college scholarship winners, if there is a state-wide competi¬ 
tion, the final examination should measure use and comprehension of the 
English language, quantitative reasoning, and ability to handle problems 
of comprehension in basic fields of knowledge including the sciences 
b In Grade 6 or 7 there should be a scholastic aptitude test as little dependent 
on academic skills as possible—i.e., a reasoning test based on material not 
directly taught in school In addition the test should probably yield a score 
based on verbal and quantitative material 
C, In Grade 10 or 11 there should be a test oriented primarily toward pre¬ 
dicting college success Its content should probably be in the region where 
aptitude and achievement merge. 

14. What reasons can you give for or against the above recommendations insofar 
as they concern type of test and placement? 

15. Among college entrants, boys tend to surpass girls on tests of mathematical 
abilities and girls surpass boys in literary interpretation How would you take 
these facts into account in planning a test to be used in a state-wide public 
competition for scholarships? 

16 Travers says (Donahue et a/., 1949, pp 148-149) 

"The most recent emphasis in testing children of elementary school age has 
been upon diagnosing the causes of difficulties in specific aspects of learning 
The development of diagnostic tests in various sub|ect matter fields has shifted 
attention from the problem of predicting over-all academic success to the prob¬ 
lem of determining the causes of academic difficulties The shift in emphasis is 
a fortunate one since it is doubtful whether over-all predictions of achievement 
in elementary school are particularly useful except where extreme deviates are 
being considered " 

Do you agree with the last statement? Can the classroom teacher use informa¬ 
tion about the pupil's MA, if it is within one year of the group average? 

The Interpretation of “General Mental Ability” 

The preceding section discussed the practical implications of mental tests, 
but before leaving the topic we should also consider the place of the concept 
of “geneial mental ability” in psychological science You will recall that the 
original measures of individual differences, tried by Galton and J McK. Cat- 
tell, isolated narrowly defined abilities such as speed of judgment. This work 



244 ESSENTIALS OF PSYCHOLOGICAL TESTING 

produced completely miscellaneous instruments and results and seemed to 
have no relevance to the general problem of human intelligence. Binet 
started a revolution with his hodgepodge, complex instrument, and all the 
more recent testeis of general ability have followed his banner. Did he really 
hit on the essence of mental ability? Oi can we hope foi some radically dif¬ 
ferent approach which will penotiate mote deeply into the problem? 

The efforts of many psychologists have been dnected to attempts to under¬ 
stand general mental ability. Fiom one side, they study the, tests and their 
correlations and tiy to infci what the common ability running through the 
test depends upon Fiom the other side, they examine thinking processes 
and try to explain what diffeientiates the more successful from the less 
successful thinker 

The large amount of leseaich based on tests has established a picture of 
the “gcneial ability” of the tests. We sec 1 the tests as, first of all, a sample of 
performance m solving a standaidr/ed intellectual pioblem While it is not 
a true sample of eveiyday life, it is nearly as complex as any practical task, 
Fai moic than the person’s “intellect” is involved Ills cffoit and his success 
depend on his self-concept, his feeling about the authority who gives the 
tost, his ability to tolerate slicss and fiustiation, and many other qualities 
The test, then, gives a pictuie oi the adjustment of the total person to a 
standardized situation making intellectual demands 

The adjustment which the lest calls foi seems to involve the ability to 
intcipret a complicated stimulus situation, to test vanous possibilities men¬ 
tally, and to cany out a iespouse which in some way "completes" the situa¬ 
tion It is evident that such intoi pi elutions aio dependent on past experience. 
Even in a sliange pioblem like the matiix the poison must select essential 
elements and biing to beai abstiaet concepts pioviously learned At the 
same time, level of development no doubt depends on innate potential. The 
mental-test score reflects piosont piofieiency, 1 e , the strucluie of habits 
and bohavioi processes winch experience has molded out of the raw ma¬ 
terial heredity piovided 

The intei pi elutions made of tests in piuctieal weak stem hugely from a 
cluneal oi imitation. U is not smpusing that clinical woikeis should lcgarcl 
the test as a measure of the functioning of the total poison lather than of in¬ 
tellect alone, and consider tins an advantage lalhoi than a disadvantage. 
Gcneial psychologists, on the continiy, have been asking just how man’s 
mind is able to mlcipict his world, and they have Iheieloio tried to isolate 
intellectual pioccsses foi separate examination, ft is fiom such “pure” re- 
seaicli on thinking processes that we today hear the strongest suggestions 
for new approaches to intellectual measurement 

The work of Piaget is representative of this tiend. His work may be re¬ 
garded as a direct continuation of the line of research Binet was engaged in 
before he turned to making mental tests for the Paris schools Binet had been 
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trying to understand how attention, memoiy, and other piocesses operate, 
and out of these experiments he diew his practical test procedures, Piaget 
has devoted his lifetime to the study of developmental changes How, he 
asks, do pciception and leasonmg differ m the older and younger child? Do 
older childien show different processes of thought, or merely superioi speed 
and complexity? (Piaget, 1947) 

Only a biief summaiy of his conclusions can he ofFered He leports that 
the changes are qualitative, that the older child thinks in quite a diffeient 
way from the youngei The child must first learn to make perceptual com¬ 
parisons and to abstract from his sense impiessions ceitam constructs or 
“schemata ” His first schemata aie meiely the identifications of objects for 
instance, the lecognition of his mother as the same peison no matter how 
her dress, postuie, and othei supeifieial appearances change tie gradually 
builds one schema upon another, theieby acquning a repeitone of tools of 
thought. Once he realizes that "an object” exists, he can think of it as con¬ 
tinuing to exist even when hidden; this stage is necessaiy before he can be 
expected to find a hidden object Helatei develops ideas of shape (constant 
oven though the letinal image changes), size, identity, ordei, etc. Foi ex¬ 
ample, the pieschool child may be able to compaie the size of two blocks, 
selecting the laigei. There is a ceitam age wheie he can judge each pair cor¬ 
rectly, and yet cannot anange a whole seiies in order. He focuses on one pan 
at a time, and cannot think of the overall oidei A schema 01 idea such as 
“order” may fust appeal m a concrete foim, i e , the child can compaie two 
bead chains only when they aie laid out side by side. Then he learns to 
hold the abstract order m mind so that he can compaie, foi example, a 
straight chain with one twisted in a “figuie eight” When the idea of older 
is completely abstiaetecl, he can solve logical problems such as “Town A is 
north of B, and C is south of B, what can you say about A and C?” 

This type of research (see also Hailow, 1949) is beginning to isolate a 
stiictly intellectual aspect of the person’s reactions to the world Solving 
any problem, it is argued, calls foi the possession of cei tain schemata The 
schematic interpietation replaces the immediate. Gestalt impression To re¬ 
produce a Block Design pattern efficiently, for instance, the child must dis- 
regaid the overall pattcin and divide the figuie mentally into equal squares. 

The person docs not peiceive his world as a physical event, Rather, he 
creates a picture m his mind, building up that picture by using whatever 
schemata he knows and consideis important. This abstract picture, being 
simpler than the world, lends itself to formal, accurate reasoning. Piaget 
and his associates, as well as workers m other centers, are now translating 
his experimental procedures into tests for individual measurement, The bead 
tests mentioned above are an example. Such tests will perhaps permit an 
inventory of the individual's equipment for thinking It is too early to say 
whether these tests will have direct practical importance, it is a good bet 
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that the total score on such a test will coirelate highly with Bmet’s hodge¬ 
podge scale. Tests based on modern cognitive theory seem certain, how¬ 
ever, to advance our understanding of thought piocesses and of the expen- 
ences which improve intelligence. 

17 . Investigators of the aged argue (J Anderson, 1956, pp 162, 170-172) that 
mental tests for the older adult should call for maturity of judgment, so that 
they would be similar to the intellectual requirements of the person's daily life 
What sort of test items would meet this demand? Could such tests be applied 
to adolescents? 

18 . One of the subtests of the Tanaka group intelligence test requires the 
sub|ect to cross slanting lines, making X’s as rapidly as he can, thus- 
XXXXX/ /////• S yc b a slJ btest is rarely used in American intelli¬ 
gence tests. On what basis could the inclusion of such a test be criticized? 
What argument or evidence would justify including this subtest in a general 
mental test? 

19 . Tuddenham (1948) gave Army Alpha to a representative sample of soldiers 
drafted in World War II Comparing these data with norms for white enlisted 
soldiers in World War I, he found that whereas only 17 percent of the 
World War I group had raw scores greater than 104, this score was the 
median of the World War II sample. How can this difference be explained? 


Suggested Readings 

Examiner’s manual, Cooperative School and College Ability Tests Pnnceton. 
Educational Testing Scivice, 1957 (oi lain edition) 

This is a good example of a modem manual Study oi all sections will he profit¬ 
able. It explains the test makci’s decisions about what to measuie, how to 
lepoit scores, etc , along with cleai details of stnndaidi/ation icseiucli 
Hebb, D 0. The growth and decline of intelligence The organization of intelli¬ 
gence. New Yoik Wiley, 1949 Pp. 274-303 

Clinical studies after brain smgerv and studies of animals sue described which 
indicate that innate potential can be distinguished fiom comprehension de¬ 
veloped in a pailiculai culluie, Hebb’s theory emphasizes the impoitance of 
appiopnntc eaily expenence to develop ability. 

Heim, Alice W Validating intelligence tests The appxusal of intelligence Lon¬ 
don Methuen, 1954 Pp 96-112 

Ilenn dcscnbes five types of investigation which may he used to judge the 
adequacy of a geneial mental lest and shows the wav m which each type of 
study, consideied alone, might lie misleading 
Tvler, Ralph W. Can intelligence tests he used to pi edit t educability? In Kenneth 
Eelis & otheis, Intelligence and cultural differences Chicago Umveisity of 
Chicago Pi css, 1951 Pp 39-47 

Tvler distinguishes between tests designed to predict success in piesenL edu¬ 
cational treatments and tests which ought be designed to select able persons 
who will not succeed in present treatments but might do well under other 
methods yet to be invented 
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Factor Analysis: 
The Sorting of Abilities 


NEARLY all the tests considered to this point grew out of Binet’s original 
discoveiy that complex problems measure general adaptive ability better 
than do simple tests of reaction and discrimination. Most innovations in 
ability testing since 1920 have been concerned with nairower abilities re¬ 
quired in particulai jobs 01 school subjects. Separate measures of verbal, 
mechanical, numencal aptitudes, etc,, weie designed, and many of them 
have proved valuable in guidance and personnel classification We might 
merely describe these tests and summarize data on their validities, but such 
a catalog would be endless. It will be bettei to look first at the modern 
techniques of classifying abilities which guide the development of such 
tests. 

Factor analysis is a systematic method for examining the meaning of a test 
by studying its correlations with other vanables. The investigator gives a 
large collection of tests to the same persons The analysis tries to determine 
how many distinct abilities are being measured reliably, to detect additional 
“trace” abilities which could be measured reliably by modifying the tests, 
and to 1 educe the confusion which results when the same ability is given 
different names m dilfeient tests, Factor analysis gives information about! 
the natuie and organization of individual characteristics and clarifies what 
any given test measures. It is used in studies of mteiests, attitudes, and per¬ 
sonality as well as m studies of ability. The purpose of this chapter is to clar¬ 
ify what factoi analysts are doing and to show how a factorial study is in¬ 
terpreted. 

It is hard to gam even a partial understanding of factoi analysis, The 
technique is complicated, though the basic idea is as simple as correlation 
itself The results of investigations have often disagreed, peihaps chiefly 
because some of the oldei work used crude techniques and insufficient data. 
During those earlier days, substantial controversies developed whose echoes 

247 
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still confuse current discussion. Fortunately, these issues have largely been 
settled, and factorists agree on many basic facts. 

Although factor analysis is mathematical, it involves considerable judg¬ 
ment The investigator chooses whatever method of organizing his results 
makes the best sense to him, and the result is variation among studies. This 
is confusing, just as it confuses the beginning student of geography to find 
different maps picturing Greenland in different ways These differences are 
of little concern to the nonspecialist, the important thing is that all maps 
agree that there is such a large island in the North Atlantic In our discus¬ 
sion of factoi analysis we shall concentrate on such major features of the 
landscape and omit technical details. 


THEORY OF FACTORIAL ANALYSIS 
Interpreting Sets of Correlations 

Looking at a collection of scores such as the Weclisler subtests, we first 
face the question, Just how many different abilities are present? The word 
ability m such a question refers to a group of performances all of which 
correlate highly with one another, and which as a group are distinct from 
(have low correlations with) performances that do not belong to the group 
(Vernon, 1956) Vocabulary tasks perhaps define such a group They hang 
together, but are they distinct from other types of items? To take a specific 
example, Wechsler Vocabulary items call for recall of word meanings. 
Weclisler Similarities items call for verbal comparison of concepts. Are these 
the same ability? Or can we interpret one as measuring word knowledge 
and the other as measuring verbal reasoning? 

For a group of junior-high-school students, 

reliability of Vocabulary = 90 

reliability of Similarities = 80 

correlation of Vocabulary and Similarities = 52 

The two tests evidently overlap Squaring the correlation of 52 tells us that 
27 percent of either test can be regarded as representing a common or over¬ 
lapping “factor.” The reliability indicates that 20 percent of the Similarities 
variance is due to enor. This leaves 53 percent of the Similarities variance, 
this nonoverlapping remainder must be due to some distinct ability, not 
common to Vocabulary Likewise, 43 percent of Vocabulary is due to an 
ability not involved in Similarities There is a common factor of verbal facility 
or reasoning, but each test also involves something independent Hence the 
two tests do involve distinct abilities. 

Factor analysis works along these general lines, starting from correlations. 
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Correlation indicates whether tests possess a common element. Binet ap¬ 
plied such reasoning when he decided that his tests, all having a substantial 
relation to each other, must be influenced by the same common factor, gen¬ 
eral intelligence Wissler, whose tests had very small intercorrelations, con¬ 
cluded correctly that his tests had very little in common and theiefore 
represented different abilities or, as we would now call them, factors 


TABLE 27, Inlercorrelations of Three 
Tests for Navy Recruits 



A 

B 

C 

A 


.81 

.69 

B 



.69 

C 





To simplify tables, each correlation is pre¬ 
sented only once. The correlation of A with B 
(or B with A) is 81 Symmetrical entries 
could be made below the diagonal if desired 
Source Conrad, 1946 

The factor concept can be illustrated by means of a series of correlation 
tables. Table 27 gives correlations of three Navy classification tests with each 
other. These data suggest two conclusions. 

Because the correlations are generally positive, the tests must be af¬ 
fected by some common charactenstic 
Tests A and B have more in common than eithei has in common with 
test C. 

The reasonableness of such a lesult is cleai when we find that A is the 
General Classification test, B the Reading test, and C the Arithmetic Reason¬ 
ing test Piobably the common element in all three tests is a composite of 
general reasoning ability and past learning. Two verbal tests may well have 
more in common than either has m common with a mathematical test 


TABLE 28. Intercorrelations of Four Measures for Adult 
Workers 



Arithmetic 

Reasoning 

Turning 

Assembly 

Vocabulary 

66 

06 

.14 

Arithmetic Reasoning 


03 

16 

Turning 



38 

Assembly 





Source. Guide to the Use of GATB, 1958, HI, G-l 1 


Table 28 has a very different pattern of correlations which shows clearly 
the presence of two distinct abilities, A verbal-educational ability is found 
in the Vocabulary and Arithmetic tests. Some psychomotor ability affects 
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both the Turning test (placing pegs in holes) and the Assembly test (as¬ 
sembling a rivet and washer) 

A formal factor analysis goes beyond inspection and calculates how much 
each test is influenced by the various factors, as we did by a simple but in¬ 
efficient method for the two Wechsler subtests (For procedures see Thur- 
stone, 1947 ) 

1. Table 29 presents correlations between six tests of the Navy classification bat¬ 
tery. Does there appear to be a single common factor among all these tests? 
If so, what might be its psychological nature? 


TABLE 29. Intercorrelations of Six Navy Classification Tests 



General 
Classifi¬ 
cation Reading 

Arithmetic 

Reasoning 

Mechanical 

Aptitude 

Electrical 

Knowl¬ 

edge 

Mechanical 

Knowl¬ 

edge 

General Classification 

.81 

.69 

.60 

.53 

.49 

Reading 


69 

56 

.51 

.46 

Arithmetic Reasoning 



.61 

.47 

41 

Mechanical Aptitude 




53 

.55 

Electrical Knowledge 





.78 

Mechanical Knowledge 







Source Conrad, 1946 


2. Which pairs of tests in Table 29 seem to have the greatest overlap? 

The Three Types of Factors 

Three types of factors are commonly distinguished, general, group, and 
specific. A specific facto r is present in one test but not in any of the others 
under study A g roup factor is present m more than one test. A general 
facto r is a factorlound in aTT the tests. If all the correlations among a set of 
tests are positive, one can find a general factor If there are any zero or nega¬ 
tive correlations, a general factor will ordinarily not be found (Figure 
43). The mathematical methods of the factor analyst determine the correla¬ 
tion between each test and each factor, These correlations provide a table 
of “factor loadings.” The square of the factor loading tells how much each 
factor contributes to the variance of the lest (cf, Table 30) 

Many factorial studies must be completed to arrive at psychological 
theory (Ahmavaara, 1957). Just what factors appear and what form they 
take depend on what tests are correlated If we analyze only numerical tests, 
there will be a numerical general factor. Put two or three numerical tests 
into a mixed collection, and the same ability shows as a group factor. Use 
just one numerical test m the battery, and the factor will be specific. 
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Possible correlations among three variables 

123 123 123 123 

1 00 1 77 1 50 1 66 

2 0 2 7 0 2 4 


Corresponding factor patterns 



All factors specific General and Group and General, group 

specific factors specific factors and specific factors 

It is also possible to have general and group, but no specific factors,- 
group factors alone, or a general factor alone These are unlikely. 

FIG 43 Possible factorial relations among tests 

3. The Wechsler test can be scored to emphasize information about a general 
factor, about group factors, or about specific factors present in various sub¬ 
tests. Demonstrate the truth of this statement, 

4. Confidence may be manifested in a variety of situations making a speech to a 
woman’s club, taking one's car apart to repair it, piloting a jet plane, or going 
to a show instead of cramming for a test. Give three alternative explanations 
of the nature of confidence one in which it is considered as a general factor, 
one in which it is divided into group factors, and one in which it is considered as 
a number of highly specific factors Which theory do you think is most adequate 9 

5. Confidence is to be considered in selecting future fighter pilots How would a 
psychologist test confidence for this purpose if he believed it to be a broad 
general trait 9 How would he proceed if he considered confidence to be specific 
to a particular situation? 

How Factor Analysis Groups Tests We shall now examine several illustrative 
results Our first example treats the Navy classification tests whose correla¬ 
tions were piesented in Table 29 These tests had various pait scores Peter¬ 
son was asked to determine how many different abilities were being meas¬ 
ured, so drat testers could report to classification officers all the scores giving 
distinct information without reporting the same ability under different 
names 

To answer this question, Peterson chose to break up the general factor 
among the tests and subtests, and rotated to obtain the “simple structure” 
shown in Table 31. There are three group factors, which may be interpreted 
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TABLE 30. Approximate Factor Loadings and Factor Com. 
position of the Navy Mechanical Comprehension Test 


Factor 

Factor 

Loading 

Percentage of 
Variance 

Verbal-educational 

35 

12 

Mechanical experience 

,64 

41 

Quantitative reasoning 

.10 

01 

Total common factors 


V 54 

("communality”) 



Error, if r« = 79 


21 

Unique 

50 

25 



100 


These values are presented solely to illustrate the form of a faotonal re¬ 
sult Though they axe derived from D Peterson's (1943) findings (see text) 
they do not give an adequate analysis of the mechanical comprehension test 
For more complete results, see Table 31 


TABLE 31, Factor Analysis of Navy Classification Test Scores 


Test 

Subdivision 

1 

Factor Loading 
II III 

a 

Specific 

Reading 

Reading 

.70 

0 

0 

X 

General Classification 

Opposites 

.76 

0 

0 

X 

(GCT) 

Analogies 

73 

0 

0 

X 


Series Completion 

68 

0 

X 

X 

Arithmetic Reasoning 
(AR) 

Arithmetic Reasoning 

56 

0 

X 

X 

Mechanical Knowledge 

Tool Relations 

0 

.69 

0 

X 

(MK) 

Mechanical Information 

X 

59 

0 

X 


Electrical Comprehension 

X 

67 

0 

X 


Mechanical Comprehension 

X 

64 

0 

X 

Mechanical Aptitude 

Block Counting 

0 

0 

61 

64 

(MAT) 

Mechanical Comprehension 

0 

X 

52 

X 


Surface Development 

X 

X 

X 

.65 


indicates factor loading between 20 and ,50, 0 represents negligible loading, below 20 In this 
analysis there are small correlations between factors which the discussion in the text ignores 
Souhce D Peterson, 1943 


by examining the tests wheie they appeal Factor I in this study might be 
given the name Veibal-Educational (frequently designated u:ecl m British 
leports). Factor II can be called Mechanical Experience. Factoi III cannot 
be named without more facts about the tests than we have given, though it 
seems to mvolve quantitative reasoning Note that two tests named “Me¬ 
chanical Comprehension” measure different factors. 

Peterson found only three common factoi s m the twelve tests In addition, 
each test measures some specific ability. Specific-factoi loadings aie fairly 
small except in Bloch Counting and Suiface Development. Thus the analysis 
su SS es l s that nearly all the information in the twelve scores can be reported 
in five scoies Factor I, Factor II, Factor III, and two specific factors To 
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simplify the record of the recruit, GCT, Reading, and AR could perhaps he 
combined into a verbal score. MK could be kept separate as a measure of 
mechanical expenence It would presumably be valuable to extend MAT to 
obtain better measuies of the tlnee factors it contains By not scoring sub¬ 
tests of GCT and MK, and by pooling similar tests, the tester would elimi¬ 
nate seven scoies fiom the rccoid a classification interviewer has to inter¬ 
pret. This condensation, however, would be too drastic 

Factor analysts concentrate on laige factor loadings and often ignore load¬ 
ings below .50. Looking only at the larger factors would imply that GCT, 
Reading, and AR duplicate one another, and that since each test is reliable 
the Navy could drop two of them Many factor analysts would have made 
precisely this diastic recommendation But Peterson did not, and he was 
coirect. 

Specific factors and group factors with loadings below .50 may be of con¬ 
siderable impoitance to validity. Table 32 shows how the three tests predict 


TABLE 32. Validity of Tesls Loaded on the v:ed Factor 
for Predicting Service-School Grades 


Training 

Arithmetic 



Course 

Reasoning 

GCT 

Reading 

Basic engineering 
Electrician's mate 

38 

.31 

30 

57 

55 

.42 

Fire control 

.34 

25 

34 

Quartermaster 

53 

.37 

36 

Cooks and bakers 

33 

54 

.40 

Storekeeper 

43 

.16 

26 


SomicE Fredenkson and Snttcr, 1953 


grades at Great Lakes Naval Training Center. In those training courses 
which require arithmetic ability, AR tends to be a better predictor than 
the other tests It would have been a mistake to di op the AR test or to pool 
it with other measuies of Factoi I Either Factor III or the specific factor in 
AR is making an important contribution to validity, even though the load¬ 
ings are below 50 In general, while factoi loadings suggest how tests may 
be grouped, final decisions oil design of testing programs should rest on in¬ 
humation about the validity of all factois including the specific ones 
In interpreting this example we may recall that a single factor analysis 
identifies the factors m a set of tests, taken as a set If one of the same tests 
were included in a different collection, a somewhat different factor composi¬ 
tion would be found. For example, if numerical tests were added to the 
battery. Factor III might divide into a numencal ability (in AR and Block 
Counting) and a geometric or spatial ability We shall see later that the MK 
test performs rather differently when analyzed m other batteries. 
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How Factor Analysis Interprets Tests The most extensive factorial interpreta¬ 
tion of tests has been done in Air Force research. Hundreds of tests of all 
sorts were used to select pilots, navigators, and bombardiers. Analyzing the 
correlations among many tests on large samples led to factonal interpreta¬ 
tions such as those graphed in Figure 44. 



Unique Factor 6% 


Numerical Operations, a nearly pure test 
Subtraction and division exercises 


Reasoning 

3% 

Directional Plotting, a mixed test 
Reporting direction between two points, 
when given their coordinates on a chart 


Mechanical Exper 3 % 

Visualization.3% 

Reasoning HI .. 2% 

Spatial 111 . . 2% 

Judgment - 1% 

Math Background 1% 

Planning .1% 

Psychomotor Precision 1% 



Perceptual Speed 4% 
Pilot Interest 3% 


Mlsc Common Factors 3% 


Complex Coordination, a highly complex test 
Job replica apparatus test 

FIG 44 Tests of different factorial purity Factor loadings were determined by analysis of a 
complete battery of AAF classification tests (Guilford, 1947, pp 828-831) The most prominent 
factor in each test Is shaded 


The three tests are seen to be quite different in structure. Numerical 
Operations contains only one important factor, whereas Directional Plotting 
is influenced by four or five factors. Though Directional Plottmg may predict 
some criterion that demands just this mixture of abilities, it is difficult to in¬ 
terpret psychologically Complex Coordination is a test in which the person 
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handles a stick and rudder in response to light signals. Its largest factor is 
one not found in other tests at all The physical and psychological complex¬ 
ity of the test is reflected in the variety of factors it involves This complexity 
evidently matches in some way the complexity of the pilot’s job, for Com¬ 
plex Coordination is one of the best predictors of pilot success (See also 
Figure 2 and pp 303 ff.) 

6. Compute the percentage contribution of each factor to the Reading test 
(Table 32), and make a composition diagram for it Assume a reliability of .85. 

7. Make a composition diagram for Block Counting 

“Simple Structure” 

One important step m any factor analysis is that called “rotation.” Rota¬ 
tion is a procedure foi placing the factois so that the results will be most 
v meaningful Peterson, for example, decided to eliminate the general factor 
which his correlations indicated, breaking up the analysis to emphasize moie 
diagnostic gioup factois 

A correlation table desenbes similanties between one test and all other 
tests The factor analyst intioduces artificial variables oi “factors” which can 
be leadily interpieted, and describes the test by its relations to these factois. 
The process is like that of describing the location of a home Jones lives next 
to Smith and Adams, half a block fiom Brown and White, three blocks from 
James, Thomas, and Schultz This description (which resembles a row m 
the conelation table) is useless if the person seeking Jones does not know 
wheie these otheis live, and inconvenient when he does know So we 
mtioduce a refeience system We locate Jones as noith of Main Street and 
west of State Oi we say he lives on this side of the highway, across the 
railroad tracks, and beyond the ice plant We can place any home m relation 
to these refeience lines All the alternative descriptions are correct, differing 
only in completeness and communication value 
The pnnciple of “simple stiucture” was suggested by L L. Thuistone, the 
great American pioneer of faclonal methods His scientific aim was to de¬ 
scribe complex performances as composites of simpler peifoimances, i.e, 
to break test scores into more fundamental elements For example, SB 
Memoiy for Sentences might be described as depending on veibal ability 
(three-tenths) and on memory ability (seven-tenths). Thurstone planned 
his factoi analysis to find group factors having small loadings in some tasks 
and large loadings in others A “simple stiucture” is one m which a large 
number of factor loadings are near zero, so that each test is described m 
terms of just a few factors Thurstone aimed first to track down group 
factors, which would have zeio loadings in some tests. Second, he aimed to 
ir discover or design “pure” tests each of which would have a high loading 
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on just one factor. Numerical Operations is one such test It measures a 
skill demanded by many tests and criteiia, but it is almost entirely independ¬ 
ent of verbal, reasoning, and otliei nonnumerical abilities. 

British investigatois have been less interested in pure measuies of simple 
abilities They, instead, rotate so as to identify broad factors present hi a 
large number of tasks, v:ed being one example 

^THURSTONE’S “PRIMARY MENTAL ABILITIES” 

The effort to isolate simple abilities is perhaps breaking down today, as we 
begin to suspect that no matter how far we advance the number of abilities 
still to be isolated always stretches far beyond the horizon But the rather 
simple system of factors which Thurstone proposed in 1938 has had great in¬ 
fluence on all subsequent classification of abilities. 

Description of the Factors 

Thurstone (1938) gave 56 tests to students at the University of Chicago 
and found six predominant factois Verbal (V), Numbei (N), Spatial (S), 
Word fluency (W), Memory (M), and Reasoning (R) Subsequent studies 
also have found these factois to be useful reference axes, though Reasoning 
in particular is treated differently m recent work Thui stone published a se¬ 
lected set of relatively pure tests to measure these “primaiy mental abilities ” 
Items from the "PMA tests” aie shown m Figuie 45. To understand the 
Thurstone factois we can examine these items and also Sf^ items known to 
have loadings on these factors. 

The verbal factor V is found in vocabulary tests, and m tests of compre¬ 
hension and reasoning There are veibal loadings m SB vocabulary, compre¬ 
hension, verbal absurdities, and other tests 

The number factoi N appears in simple arithmetic tests Tests of arith¬ 
metic skill are purer measures of N than are tests of aiithmetic reasoning. 
Giving the number of fingers on one hand, and repeating digits backward 
are among the Bmet items with loadings on N. 

The spatial factor S deals with visual form relationships. Spatial loadings 
appear in picture absurdities, copying a diamond, drawing a design from 
memory, and paper cutting. 

The memorizing factor M appears m tests which call for rapid rote learn¬ 
ing, including memory for words, digits, and designs 

Reasoning, R, appears in tests requiring induction of a rule from several 
instances Reasonmg factors appear in the SB plan-of-search test, ingenuity 
(water-jar problems) and similarities between concepts 

The word-fluency factor W (which is clearly distinct from V) calls for 
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V 

N 


' Today much of our clothing is designed to make a fashionable ap¬ 
pearance rather than foi 

- style piotection childien sale dresses 

_ Synonyms: quiet blue still tense watery 

' Is this addition right or wrong? 42 

61 

83 

176 

Mark every number that is exactly three more than the number just 
befoi e it 

4 11 14 10 9 12 16 8 10 3 


s 


f Put a maik under eoeiy figure which is like the first figure m the row 


R 


8 'A y 


A 

^ qp 

t**» s«j: t: : n rm 


M l Study associations such as “chair-21" and "box-44 ” Mark the correct 
| number on a later test 

Letter series (Which letter comes nextP) 
abxcdxefxghx . , 

Letter groupings (Which group is different?) 

AAAB AAAM AAAR AATV 

w 1 List as many four-letter words beginning with C as you can, 



FIG 45 Items from the Chicago Tests of Primary Mental Abilities, (Copyright 1941 by L. L. and 
Thelma Gwinn Thurstone Reproduced by permission of Mrs Thurstone and Science Research Asso¬ 
ciates ) 


ability to thmk of words rapidly, as m anagrams and rhyming It is not found 
m the Stanford-Bmet. The distinction between V and W is shown in two 
synonym tests tried by Thurstone A test requiring the subject to select the 
correct synonym from several choices was saturated with V but not W, a test 
in which the subject rapidly supplies three synonyms for an easy word meas¬ 
ured W, not V. 

Thurstone’s list of primary abilities is a convenient reference system The 
word “primary/’ however, suggests that the list is more than a mattei of con- 
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vemence, that it represents something fundamental about the way the 
mind works, This implication raises questions which it has taken twenty 
years of research to answer. 

Is general ability nothing but a mixture or average of the primary abil¬ 
ities? 

Are these factors the only ones into which these tests could be divided? 
Are these factors unitary and indivisible? 

Is tins a complete list of mental abilities? 

Is this factor structure a reflection of innate human nature or of cultural 
influence? 

To these might be added questions regarding piedictive validity of the 
factors, but information on that subject will be accumulated through several 
succeeding chapters 

8. Do Thurstone's tests cover the same ground as the Kuhlmann-Anderson test? 
Can you find Kuhlmann-Anderson items which appear to represent each Thur- 
stone factor? 

9 Which Thurstone factors are most consistent with Binet’s description of intel¬ 
ligence’ 

10 What factors would you expect to influence Wechsler Vocabulary scores’ 
Digit Symbol? 

The Status of General Ability 

Thurstone intended by the name "primary abilities” to suggest that these 
abilities combine to produce aptitude for any complex intellectual per¬ 
formance, just as green, red, and blue spotlights can be mingled to produce 
any other hue, or white If this is true, general mental ability is nothing but a 
mixture of primaiies m some proportion In sharp opposition is the view of 
Galton and Spearman that some persons are endowed with superioi general 
adaptive ability which might be turned in various directions. This conflict 
of views was sharpened when Thurstone found no general mterconelation 
among his Chicago tests Since he found near-zero correlations between 
ability tests, he argued that no general factor exists. 

Subsequent research has altered his argument The low conelations 
proved to be due to the very restricted range of the University of Chicago 
sample In less select groups, even Thurstone and his associates found gen¬ 
eral mtercorrelations As Burt (1958, p 5) says. 

In nearly every factorial study of cognitive ability, the general factor 
commonly accounts for quite 50% of die variance (rather more m the 
case of the young child, rather less with older age groups) while each of 
the minor factors accounts for only 10% or less . , For all practical 
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purposes, almost every psychologist—even former opponents of the 
concept of general intelligence, like Thorndike, Brown, Thomson, and 
Thurstone—seems m the end to have come round to much the same 
conclusion, even though, for theoretical purposes, each tends to reword 
it m a modified terminology of his own. 

The issue then reduces to how to take the general factor into account. 
Holzinger m America and Burt m England preferred to pay attention to 
the general factor first, and then to see what further information group fac¬ 
tors add Thurstone preferred to concentrate on group factors, and to account 
for the overall relation by identifying a "second-order" factor which unites 
the groups 

The Determinacy of Factors 

Thurstone’s list has often been regarded as a list of the basic elements of 
the human mind Some persons have compared it to the chemist’s list of ele¬ 
ments Others, critical of the approach, have condemned it as a new “fac¬ 
ulty” psychology 

Factor analysis is in no sense comparable to the chemist’s search for ele¬ 
ments There is only one answer to the question What elements make up 
table salt? In factor analysis there are many answers, all equally true but 
not equally satisfactory (Guttman, 1955) The factoi analyst may be com¬ 
pared to the photographer trying to picture a building as revealmgly as 
possible Wherever he sets his camera, he will lose some information, but by 
a skillful choice he will be able to show a large number of important features 
of the building 

The fact that many other mvestigatois find similar factors has made it 
seem as if Thurstone's list did embody some fundamental truth Yet his list 
does not include anything like the v ed factor of the British investigators, 
and Ins N factor is defined by simple arithmetic skill rather than by leason- 
ing Location of reference factois is a mattei of judgment 

Thuistone’s choice of his particulai factors was dictated by a criterion of 
simplicity. ITe wanted nreducible factois and theiefoie matched his factors 
to very simple tests wheievci he could A test whose items seemed, on in¬ 
spection, to involve many types of mental process would not satisfy him as a 
measure of a pure factor This explains, for example, why N was defined in 
terms of elementary, overlearned computational skills 

The meanings of factors shift from time to time as new evidence and new 
criteria are introduced As we shall see, N, R, and S have somewhat different 
meanings in current studies from the meanings they had m the 1938 list. 
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Divisibility of Factors 

Thurstone and his students discarded the view that factors are irreducible. 
While verbal tests have enough m common to define a verbal factor, they 
can be divided into several subgroups, thus establishing narrower factois 
within the verbal domain One can divide a vocabulary test, for instance, into 
subgroups of words fiom different content areas Within the subfactor of 
“science vocabulary,” we would find that some students know more chemi¬ 
cal terms than psychological ones This could be pursued down to ridic¬ 
ulously fine detail Other factors subdivide similaily 

Factor analysts now lecogmze that abilities are most clearly described 
by a hierarchy ranging from the veiy broad factors to those present only in 
very specific tests One can plan his statistical analysis to find only the high- 
level factors, to find only factors of intermediate bieadth, or to isolate 
dozens of detailed factors Many investigates have suggested possible 
hierarchical arrangements, but all the pioposals are tentative at piesent, 
subject to verification by further data Vernon’s diagram (Figuie 46) is one 
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Factors 

Minor Group 
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Verbal Number 
(v) (n) 

U X X 


Practical [km) 


i-1 r~ 

Mechanical Spatial Manual 
Information (It) 


X JL 


FIG 46 Sketch of a possible hierarchy of abilities (After Vernon/ 1950 7 pp 22—23 ) 


such suggestion (For an actual factor analysis deriving hierarchical factors 
at three levels, see Mouisy, 1952, and Laugier, 1955, pp. 187-208.) 


Completeness of the List 

The precedmg remarks have aheady indicated that the number of possible 
factors is inexhaustible, if we are willing to make the factois sufficiently 
trivial The question remains, howevei, whether significant factors can be 
discovered beyond Thuistone’s list. The answer is emphatically “yes ” These 
remarks were written at the end of World War II (F, B Davis, 1947, p 59). 

The lesults of testing hundreds of thousands of men in the armed 
forces and of analyzing these data suggest to many psychologists that 
the number of basic mental abilities may often have been underesti- 
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mated From factorial analyses of many different matrices of mtercor- 
relations obtained as a lesult of testing aviation cadets in AAF classifica¬ 
tion centers, factois that have been mathematically determined have 
been named as indicated m the following list. 


Carefulness 
Geneial reasoning I 
Integration I 
Integration II 
Integration III 
Judgment 
Kinesthetic motoi 
Length estimation 
Mathematical background 
Mathematical reasoning 
Mechanical expeiience 
Memory I 
Memory II 
Memoiy III 
Numencal 


Perceptual speed 
Pilot interest 
Planning 

Psychomotor coordination 
Psychomotor precision 
Psychomotor speed 
Reasoning II 
Reasonmg III 
Social science background 
Spatial Relations I 
Spatial Relations II 
Spatial Relations III 
Veibal 
Visualization 


Theie is no objective method of determining whether the names at¬ 
tached to the factois discoveied in the analyses are accurate descrip¬ 
tions of the mental abilities represented by the factois In any case, . . . 
the number of basic mental abilities may be much laiger than was 
formerly believed 

Some of these added factois came from the extension of factorial investi¬ 
gations to psychomotor tests Some came fiom bringing new pencil-and-papei 
tests into the analysis. Some came as a lesult of subdivisions—but not trivial 
subdivisions—of the Thurstone factors 
One gets out of a factor analysis only what he puts m. This lemark has be¬ 
come trite, but it is of basic impoitance. Factoi analysis soits the abilities 
present in the lest batteiy; it does not unearth new ones. Thuislone identified 
the common elements in tests such as psychologists had been geneially 
using If psychologists had not vet designed tests covenng some impoitant 
ability, that ability could not show in Thuistone’s list. The An Foice invented 
and tried out many additional possibilities but by no means coveied the 
range of possible ability tests. 


Origin of Factors 

Many of those who believe that factor analysis is identifying “the way 
human abilities aie organized” think that biological nature determines what 
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factors are found It is conceivable that perceptual speed and spatial judg¬ 
ment rely on different neural processes, and that a person could be su¬ 
perior on one process and not the other It is equally possible to argue that 
correlations between abilities are produced by experience Numerical per¬ 
formances develop together, presumably because they aie taught together 
A child who makes a bad start in arithmetic because of poor teaching will 
lag in all numerical tasks even though he may do well m verbal work There 
is no need to conclude that the separation of numerical from veibal facility 
is inborn On the other hand, we may yet find intellectual patterns of un¬ 
deniable hereditary origin Many sensory differences (eg, color vision) are 
of this cliaractei 

11. One student says, “It seems to me the factor analysts are like astronomers try¬ 
ing to discover planets The astronomer finds a new planet by detecting the 
pull it exerts on already known bodies. Then he makes more careful studies to 
check his conclusion and locate the planet exactly. The factor analyst locates 
one test against already established abilities.” How satisfactory is this com¬ 
parison? 

12. Another student suggests that factors are comparable to constellations of stars, 
which the astronomer uses to label portions of the sky (e.g., "The nebula is in 
Orion") How apt is this comparison? 

THE PRESENT STATUS OF FACTOR ANALYSIS 

From some points of view factor analysis has been a great success It pro¬ 
vides precise methods for handling large numbers of vaiiables and for le- 
ducmg them to a much smaller numbei of scores with little loss of informa¬ 
tion. Thus factor analysis is a highly important statistical method Secondly, 
factor analysis has cut through a large amount of nonsensical mterpietation 
which results fiom assuming that every test with a different name measuies 
a different ability. Thirdly, factor analysis helps to describe what a test meas¬ 
ures It is gradually establishing a reference system that all psychologists 
can use to describe tests 

Some critics of factor studies were disappointed when they found that not 
all factors measured practically significant mental abilities Even one of the 
pioneers in the field (Kelley, 1939) spoke of the discoveries as “mental fac¬ 
tors of no impoitance.” Probably the correct position to take is that factor 
studies clarify what present tests measuie They cannot identify factois not 
built into the original tests They cannot guarantee to produce factors of 
practical impoitance But by clarifying the content of tests they peimit the 
psychologist to decide whether he is satisfied with them and help him to 
throw out the components that are useless Furthermore, the sorting of abil¬ 
ities dnects research to the question. For what is each of these human 
talents useful, and how can we capitalize on it? 
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The great goal of the factor analyst has been to discover a dependable list 
of the important abilities Many abilities have been defined, and tests which 
give fairly puie measures for most of these factors are available R B Cat- 
tell (Laugier, 1955, pp 319-325) has proposed that an “international in¬ 
dex” of well-established factors be prepared, but the othei leading factor 
analysts argue that the definitions of factois aie still shifting and that to 
“freeze” any present list would be premature. The difficulty is not that factor 
analysis fails to analyze data correctly The problem is that, with so many 
different ways of placing reference factois, it is unlikely that the best pos¬ 
sible system has been found 

Many investigators feel that factor analysis has paid undue attention to 
the content of the test item, i e, to whether it deals with words, numbers, 
forms, or other symbols Content groupings aie of couise to be found in tests 
using different content A more fundamental problem, however, is the or¬ 
ganization of mental process. Thuistone distinguished thiee such piocesses 
memory, reasoning, and fluency Meili (1946, 1955) found evidence for 
fluency, complexity (application of such ideas as oicler), plasticity (lestruc- 
turmg, as in Block Design oi Hidden Figuies), and integration (titling a 
picture) These may be subdivisions of Thuistone’s reasoning factor. At¬ 
tempts to study piocess are growing in numbei but are not near to final re¬ 
sults. 

A tentative three-way oigamzation of intellectual tasks has lecently been 
suggested by J P Guilford, as an outcome of a long senes of studies of high- 
level intellectual perfoimance. Guilford (1957) distinguishes five types of 
mental operation 

Memory—retention of information 

Cognition—recognizing patterns, facts, etc 

Conveigent* thinking—proceeding from information to a specific “right an¬ 
swer” 

Divergent thinking—proceeding from information to a variety of adequate 
solutions (as m finding titles to fit a plot) 

Evaluation—decisions concerning goodness or appropriateness of ideas (eg, 
judging which pioblems are significant) 

Tasks within each of these categories can be classified with respect to 
“content” and “product” The content categories are figuial (directly per¬ 
ceived objects, events, drawings, etc), symbolic (letters, numbeis, etc.), 
semantic (verbal), and “behavioral” (interpretation of human behavior) 
The six kinds of “products” distinguished by Guilford are units of informa¬ 
tion, classes of units, relations between units, systems of information, trans¬ 
formations, and implications. 

Since there are five operations, four content categories, and six prod¬ 
ucts, there are 120 different combinations Each combination represents a 
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type of task which is or can be represented in an intellectual test, according 
to Guilford, For example, the common verbal comprehension tests fit into 
cognition of semantic units of information A test which asks the subject to 
find the name of a sport concealed in the sentence “He chose a Mongol for 
his bride” is classified as convergent thinking—symbolic — transformation. 

Guilford’s system is still undergoing development, and it is not yet clear 
how well his categories reproduce the empirical relations found through 
factor analysis. The system emphasizes the distinction between test content 
and test process and is therefore an advance over the Thurstone explora¬ 
tions in which piocesses (fluency, reasoning) were mixed indiscriminately 
with content (verbal, spatial) factors. 

The striking thing about Guilford’s system, apart from its bold break with 
tradition, is the vast numbei of ability factors he requires. His system has 
over 120 cells, of which peihaps 50 have been matched with tests It be¬ 
gins to be clear that we will nevei again have a hst of a few simple primary 
abilities. According to Guilfoid (1957, p 20) 

The obvious implication for intelligence testing is that the trend to¬ 
ward the multiple-scoie approach and the enlightened composite- 
score appioach should be acceleiated The single, somewhat haphaz¬ 
ardly composed, score has worked well, perhaps too well, hence the 
unwarranted complacency regaidmg it It would seem that we now 
have information that should make possible a considerable advance in 
refinements of measuiement of intelligence. If the appaient complexity 
implied is appalling, what seems to be needed is the courage to face 
reality If the next steps do not seem to be clear, then the cure is more 
knowledge—knowledge concerning the whole list of intellectual factors, 
their relations to complex mental functioning, and their relations to 
everyday behavioi, 

13 Compare Meili's four factors to Guilford's major factors. 

14. Where, in Guilford's system, do Thurstone's V, S, and W factors appear? 


A FACTOR ANALYSIS OF THE WECHSLER SCALE 

As a final example of factoi analysis in test interpretation, we turn to a study 
of the Wechsler subtests (P. C Davis, 1956). This analysis illustrates the 
modern technique of using “reference tests” to define factors 
Davis gave Form I of the Wechsler-Bellevue (very similar to WAIS) to 
202 eighth-graders in Seattle He wanted to learn as much as possible about 
subtest meanings and believed that group factors would be found if the 
obvious general factor among the subtests was broken up. He predicted the 
presence of particular factors, and for each such factor he introduced one 
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or two reference tests into his battery to aid in rotation and interpretation 
A reference test is one legarded as a fairly puie measure of a factor. The 
predicted factors and their refeience tests were as follows 

Verbal comprehension. Two tests (Nos. 9 and 10 m Table 33) One 
called for choice of a correct synonym, one called for writing of defini¬ 
tions 

Numerical facility Numerical Opeiations test (4, see Figure 44) calls 
for simple addition, division, etc, under time pressure. 

Peiceptual speed Test (3) lequnes the pupil to match a figure of 
stiange form with whichever of five other figures is exactly hlce it. 

Visualization. Mechanical principles (6) This test adapted from Air 
Force research resembles the Bennett TMC. Visualization is a factor com¬ 
monly identified m spatial tests (see p. 279), requiring understanding of 
movements of objects Test 6 is influenced by both visualization and 
mechanical experience (Guilford, 1947, p 894) and therefore is a 
rather pooi reference test 

Arithmetic Reasoning (5), Veibally stated pioblems making little de¬ 
mand on compulation are admmisteied as a highly speeded test. 

Mechanical information The test (7) asks about a variety of tools and 
machinery. 

In addition, Davis included other conveniently available scores age, an 
Ohs group test, and a test of cuirent scientific infoimation Finally, he 
adapted parallel foims of three Wechslei subtests for gioup administration. 

The correlations weie almost all positive Davis found ten factois, al¬ 
though he had suggested six initially He lotated them to obtain a simple 
structure, that is, a pattern m which each test is loaded on few factors, and 
each factor is found m only a limited group of tests The factois aie listed in 
Table 33. (All loadings lower than 30 aie omitted to reduce confusion.) 
Specific factor loadings for the Wechslei tests weie calculated by the 
present writer, using a rough estimate of error variance (Foi the other tests, 
enor and specific factois cannot be scpaiated because them leliabilihes are 
unknown.) 

Befoie going further, let us note that this is a factor analysis of the Wechs- 
ler, not the only possible analysis. A somewhat different stiuctuie would oc¬ 
cur in a sample of a different kind. A diffeient investigatoi might choose a 
slightly different rotation. Indeed, a different choice of lefeience tests 
would introduce some otliei factors But these differences should be rela¬ 
tively minoi m view of Davis’ large sample and large number of measures 
We would interpret factors as follows' 1 

1 The writer has given some of the factors names different from Davis’, for the sake of 
coordination with other analyses described m this book 
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TABLE 33. Factor Analysis of Wechsler Scores with Reference Variables 



1. Age 



-.32 


-.40 

2. Otis Beta 

.54 


30 


47 

3 Perceptual Speed 



.30 

.52* 


4 Numerical Operations 



74* 

31 


5 Arithmetic Reasoning 


.44* 

32 



6 Mechanical Principles 



39 

.64* 


7 Mechanical Information 



44* 



8 Science Information 

.57 


.32 



9 Synonyms 

.75* 



.31 


10 Word Definition 

.80* 




37 

11. Information—group 

61 


44 



12 Comprehension—group 

.38 




62 

13. Similarities—group 


.32 

.44 

.38 .65 



14 W Information 

.42 




56 




.46 

15. W Comprehension 


33 







62 

16 W Digit Span 



.34 





37 

52 

17 W Arithmetic 


.57 

34 

36 

32 





18 W Similarities 

.31 






.65 


54 

19. W Vocabulary 

.60 








.45 

20 W Picture Arrangement 









49 47 

21 W Picture Completion 







38 


.58 

22 W Block Design 



.41 


30 

.38 

.44 


40 

23 W Ob|ect Assembly 






.42 

.34 


56 

24. W Digit Symbol 




52 


37 


30 

.61 


Source Based on unpublished data supplied by Paul C Davis Asterisks indicate reference tests 


V—The first hypothesized factor, verbal comprehension, was defined by 
reference tests 9 and 10. Wechsler Vocabulaiy and Otis Beta are espe¬ 
cially loaded with this factor Three of the Wechsler “Verbal” tests, how¬ 
ever, have loadings below 30 

VPS—This factor appears in both of the Anthmetic Reasoning tests, 
in Compiehension, and in Group Similarities. It might be titled verbal 
problem solving 

NV —Nearly all the tests requiring thinking about numbers or objects 
have moderate loadings on this factor The common factor seems to in¬ 
volve some sort of nonverbal reasoning This factor appealed instead of 
the hypothesized factor of mechanical information. The test of mechani- 
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cal information did not have much in common with the Wechsler tests 
jy.—This factoi has only small loadings on tests other than the refei- 
ence test, Numencal Opeiations. The high loading of Digit Symbol must 
be due to the maiked speeding of both tests, rather than to the-fact that 
both tests involve numbeis. We call the factoi numencal speed 
l —This is an mfoi million 01 general education factor 
p —Perceptual speed is identified by the leference test It appeals m 
three of the Wechslei Peifoimance tests Loadings in two unspeeded 
verbal tests (9 and 13) seem to contiadict the inteipietation, but the load¬ 
ings might lesult fiom sampling erroi 
Vz—This factoi is present in the mechanical comprehension test, Pic¬ 
ture Completion, Block Design, and Object Assembly It can be only 
vaguely mteipieted as some soit of spatial or visualization ability 
E —This is a factoi defined only by similanty items. If only one of the 
two Simihuities tests had been used, it would have been a specific fac¬ 
toi Heie is an example of changing a specific factoi to a “gioup factor” by 
bunging a snnilai test into the batteiy 
F —This is found m Gioup Compichension, Digit Span, and Block De¬ 
sign It is a minor factoi having no lefeience test, and any intei pi etation 
of it is speculative 

.?—This unanticipated factoi links Pictuie Airangement, Woid Defini¬ 
tion, Otis Beta, and Age It is an uninlcipietable factor having some¬ 
thing to do with leasoning or education 

All Wechslei tests save Aiithmetic have specific loadings over 30 The 
notable specific factois are found in Compiehension, Pictuie Completion, 
Object Assembly, and Digit Symbol 
What, now, have we learned about the Wechsler test? 

® That if we hicak up the Verbal, Performance, and Full scale scoies, 
we can find a large numbei of chfTeienl abilities within the test Theie is no 
leason to think that Davis’ ten factois plus four unique factois constitute 
the most refined subdivision possible 

« That Wechslei subte.sts very laicly coiiospond to psychologically 
simple abilities No Wechslei lest is anywheie neai to a piue measuie of a 
commonly accepted lefeience factor 

# That it appeals possible to estimate individual scoies on some factors 
horn appropnate combinations of Wechsler subtests One could obtain mod¬ 
erately dependable measuies of Veibal Compiehension, Visualization, 
Numerical Speed, Perceptual Speed, and Veibal Problem Solving as distinct 
abilities. The other factois are not leliably measured by tire Wechslei sub¬ 
tests. 
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® That the Wechsler scores include a good deal of information about 
individual differences not obseived in the gioup tests. The Peiformance sub¬ 
tests in particular resisted descnption in teims of factois common to 
odier tests. Moreovei, the subtests are to an appieciable degree distinct 
from each other. 

What facts about factoi analysis are illustrated m this study? 

« That a good deal of variance, probably lepiesentmg complex integra¬ 
tive processes, usually remains m specific factors 
« That diff eient interpieters may disagice as to the psychological mean¬ 
ing of a factor, but that such disagreement is reduced when a factor is 
marked by a well-understood lefeience test. 

® That the minor factois m a study aie usually difficult to interpiet, To 
delineate them clearly it is necessary to include lefeience vaiiables foi 
these factors m a fuithei study 

15. What is the factorial composition of Otis Beta, assuming a reliability of .90? 

16. What does the factor analysis suggest as to the best subtests to use in a short 
form of the Wechsler’ 

17. Davis suggests that if only five tests are used in the Verbal scale, Comprehen¬ 
sion might well be omitted His reason is that it has low loadings on numerous 
factors What reasons might justify keeping it in the scale? 

18. Davis names factor VPS "general reasoning ” Why is this name open to ques¬ 
tion? 


Suggested Readings 

Schutz, Ilicliaid E Patterns of peisonal problems of adolescent gills J educ. 
Psychol, 1958,49,1-5 

A factoi analysis of a peisonality questionnaue shows how factonal results 
lead to a diffeient oiganization of the scoies fiom that amved at by classifying 
items accoiding to appaient content. 

Vernon, Philip E. Mental faculties and factois, and Landmaiks m the dcwclopment 
of factor analysis The shuctu le of human abilities New Yoik Wiley, 1950. 
Pp 1-24 

Vernon shows by simple calculations how a factoi analysis is peifoimed, 
warns against common misinterpretations, and leviews sevcial of the most 
impoitant analyses of abilities 
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Differential Abilities in Guidance 


THURSTONE intended his Tests of Prnnaiy Mental Abilities to be used m 
guidance, hoping tlut the pei son’s pattern of abilities would indicate the 
couises and jobs wheie he could expect gieatest success We aic now leady 
to examine how validly such patteins can be mteipieted, diawing on evi¬ 
dence legaidmg the difleiential battenes whose validity has been most 
thoioughly tested We begin by desciibing two of them, the Diffcienlial 
Aptitude Tests (DAT) and the Geneial Aptitude Test Battery (GATB) 

j THE DIFFERENTIAL APTITUDE TESTS 

( The DAT batteiy was published in 1947, primarily for lngh-school counsel¬ 
ing The eight tests measuie aptitudes which pievious leseaich had suggested 
as rmpoitant in guidance Among the tests aie a modification of the TMC, 
a cleiical aptitude test, a spelling test, and a veibal leasonmg test This 
paitial list makes it cleai that the DAT is quite cliff eient from the PM A bat¬ 
teiy. No attempt is made to isolate simple, puie abilities, Instead, the tests 
aim to measuie complex abilities which have a fanly dnect lelation to job 
families and cunicula Measuies of pioficiency aie included because of their 
picchcLive value. 

The tests lequue six to thuty minutes of working time. With the addition 
of time for ducclions, thice sessions of eighty minutes each aie lequued foi 
the batteiy Except foi the Cleiical lest, the tests aie essentially unspeeded 
Items fiom each of the tests except Mechanical Reasoning and Veibal 
Reasoning aie piesented in Figure 47. (For MR and VR, see pp 40, 
235.) 

The publication of this mtegialed collection marked an impoitant foi ward 
step m aptitude testing The counselor desiring tests of this natuie previ¬ 
ously had had to make up his own collection, using tests standardized and 
validated on different samples. Interpretation of profiles was theiefore m- 
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NUMERICAL ABILITY. 


1 Add 


Answer 


393 

4658 

3790 

67 


A 7908 
B 8608 
C 8898 
D 8908 

E none of these 


ABSTRACT REASONING. Which figuie is next in the senes? 



CLERICAL SPEED AND ACCURACY, 
also underlined at left. 


AB 

AC 

AD 

AE 

AF 

aA 

aB 

BA 

Ba 

Bb 

A7 

7A 

B7 

7B 

AB 


Undeihne at right the symbol which is 


AC 

AE 

AF 

AB 

AD 

BA 

Ba 

Bb 

aA 

aB 

7B 

B7 

AB 

7A 

A7 


m gi ammai, punctua- 


they landed / somewheies m Flonda 
E 

(Items copyright 1947, The Psy- 


SPELLING. Which woids are lnconectly spelled? 

apointed 

commission 

visimty 

SENTENCES. Which pints of the sentence are mconecl 
tion, or spelling? 

Ain't we / going to the / office / next week / at all 
A B C D E 

They / neaily were / starved / befote 
A B C 

FIG 47 Items from subtests of the Differential Aptitude Battery 
chological Corporation Reproduced by permission ) 


/ 
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inexact at best, Percentile conveisions for all DAT scores have been calcu¬ 
lated on the same sample, so that profile shapes aie meaningful Moreover, 
the tests have been matched in difficulty so that all of them can be applied 
satisfactorily to the same subjects 

The intei correlations and ^liabilities of the tests are presented in Ta¬ 
ble 34, Reliabilities aie split-half coefficients, except for the speeded Clerical 


TABLE 34. Intercorrelations and Reliabilities of DAT Scores 



VR 

NA 

AR 

SR 

MR 

Cler¬ 

ical 

Spell¬ 

ing 

Sen¬ 

tences 

VR 

88 








NA 

50 

88 







AR 

.51 

49 

86 






SR 

35 

35 

49 

92 





MR 

44 

25 

.48 

43 

85 




Clerical 

10 

08 

10 

05 

.04 

83 



Spelling 

48 

36 

25 

.14 

16 

14 

.92 


Sentences 

53 

43 

36 

23 

.26 

11 

59 

86 


Source Bennett at al , 1947, pp C-5, C-10 


test, where a between-foims coefficient was used These data are for mnth- 
giade boys It is evident that the tests measure with adequate precision 
Second, we may note that the tests, except foi Clerical, involve a general fac¬ 
tor. Thud, and of gieat impoitance, the conelations between tests are 
much lowei than their reliabilities This assuies that each test is independent 
of the otheis to a substantial, degree 
In ordei to emphasize the concept of multiple abilities, as distinct from the 
single composite ability commonly measuied m pievious tests, the DAT 
originally piovided no total 01 general score. The authors later responded to 
the connseloi’s demand for an oveiall piedictoi by developing norms for the 
combination VR + NA. This composite seives the same puipose as the 
gioup tests of geneial ability 01 scholastic aptitude m common use 

1. The manual suggests that the DAT may be given in two, three, or six sessions, 
adjusting the length of session appropriately Which arrangement would you 
consider wisest? 

2. Prepare a composition diagram like Figure 37 to show the breakdown into com¬ 
mon and independent elements of these pairs of tests 

a Verbal-Abstract 
b. Numerical-Clerical 

3 If a person being counseled has been tested with the Wechsler, which of the 
Differential Aptitude Tests would add the most useful supplementary informa¬ 
tion? 

4, In what high-school subjects would you expect the Space Relations score to pre¬ 
dict success better than the Abstract Reasoning score? 
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1 THE GENERAL APTITUDE TEST BATTERY 

In marked contrast to the DAT in form and function is the GATB. This bat- 
k^tery was produced by the U S Employment Seivice and is used throughout 
the country for guiding persons seeking woik. The constiuction of the bat¬ 
tery was strongly influenced both by Thurstone s factor-analytic studies and 
by three decades of leseaich on job peiformance Several of the tests aie 
descended from the pioneei Minnesota senes of vocational aptitude tests, 
which date back to the I920’s. 

The USES tests are given only through state employment seiviccs The 
tests are often given to high-school juniois and seniors under a cooperative 
plan which makes the results available to both the high-school counselor 
and the employment service. Veisions of the tests are now being piepaied 
in at least 27 foreign countnes, 

The employment seivices aie pnmanly concerned with guiding the person 
into suitable woik Theie aie thousands of jobs in the modem industnal 
woild, each having its own aptitude requnements When an employei asks 
for lefenals of potential employees, he wants applicants who are likely to 
succeed. The USES, woiking with state agencies, theiefoie conducts studies 
of the psychological ehaiactenstics of particulai jobs and accumulates in¬ 
formation on the meaning of test scoi cs Dvoiak (1956) mentions the follow¬ 
ing occupations having been studied duiing a single year, assemblei of dry¬ 
cell batteries, aiicraft electrician, teachei, X-iay technician, nuise aid, 
sheet-metal woiker, baker, cook, spot welder, comptometer operator, corn- 
huskmg-machine operator, knitting-machine fixei, and fruit packer Predic¬ 
tion foi such jobs takes us far beyond the academic and masoning abilities 
which predominate in the tests studied so far m this book. 

Tire diversity of occupations rules out the possibility of devising a separate 
aptitude test for each job. At one time, the USES had staited to build dif¬ 
ferent tests for each job family, but when the total number of tests passed 
100, it became cleai that such a collection could not be used foi guidance, 
howevei suitable any of the separate tests might be foi sci eenmg applicants 
for one job Foi guidance we need a limited number of diversified tests 
which can be given to eveiyone and which can be linked togethei in various 
combinations to predict success in different situations With this end m view, 
the current fonn of the GATB uses eight pencil-papei and foui appaiatus 
tests to measuie nine distinct factors. 

G—General reasoning ability (a composite of tests titled Vocabulaiy, Thiee- 
Dimensional Space, and Anthmetic Reasoning) 

V—Veibal aptitude (Vocabulary) 

N —Numerical aptitude (Computation, Arithmetic Reasoning) 

S—Spatial aptitude (Thiee-Dimensional Space) 

P—Form perception (Tool Matching, Foim Matching) x 
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C q —Clerical perception (Name Companson) 

K —Motor coordination (Maik Making) 
p —Fingei dextenty (Assemble, Disassemble) 

M —Manual dextenty (Place, Turn) 

(An eailier foim had an additional measuie of Eye-Hand Coordination or 
Aiming [A], and factor K was lefened to as T ) 

We can skip ovei Vocabulary, Anthmetic Reasoning, and Computation 
without fuither description The Space test is much like the DAT spatial 
test, Name Companson, like DAT-Clerical, requiies quick checking to detect 
discrepancies between two lists. The USES veision gives two lists of names 
of business firms, identical except foi enoix of style and spelling This tech¬ 
nique of name comparison was invented for the Minnesota Clerical Aptitude 
Test, one of the eailiest successful special aptitude tests 
Tool Matching calls for lapid visual companson of pictuies of tools, 
alike save foi diffeiences in shading. The only reason for showing tools 
rather than abstiact foims is to mciease the subject's interest Form Match¬ 
ing is a pencil-papei adaptation of a formboaid used in the Minnesota stud¬ 
ies in which dozens of megulai shapes were cut out of a board. The subject 
was to fit each shape into the conect hole (see Figuie 48). In the USES 
test, the shapes aie punted m two diflcient anangements, and the subject 
must match identical forms The test appears much like Figuie 48, save 




FIG 48. Minnesota Spatial Relations Formboard (Courtesy Educational Test Bureau) 
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that the shapes are larger Changing from a formboaid to a printed test un¬ 
doubtedly simplified the factor composition of the test by eliminating dex¬ 
terity from the score, and so made it moie inteipretable as well as easier to 
administer 

Maik Making, a psychomotoi test, is likewise designed to meet the needs 
of a program which tests a million people each yeai The subject is asked 
only to make maiks like these 1L m each squaie, filling as many squares as 
he can in sixty seconds 

The Place and Turn tests are derived fiom the Minnesota Rate of Manipu¬ 
lation Test Foity-eight pegs are placed m a pegboaid. A second boaid with 
lows of holes is provided, and the subject tiansfeis the pegs fiom one boaid 
to another as fast as possible In the Turn test, he mveits each peg while 
transferring it 

The tests named Assemble and Disassemble call for finer cooi dmation, 
using both hands. A hoard contains fifty holes. The peison is to fit a nvel and 
washer into each hole In Disassemble, be leplaces the uvets m then onginal 
bin and puts the washeis onto the rod wheie they aie stored. 


TABLE 35. Intercorrelations and Reliabilities of GATB Scores for High- 
School Seniors 



G 

V 

N 

s 

P 

G 

K 

F 

M 

G—General 

.85 









V —Verbal 

— 

86 








N—Numerical 

— 

42 

82 







5—Spatial 

— 

.40 

34 

81 






P —Form Perception 

43 

34 

42 

48 

72 





Q —Clerical Perception 

.35 

29 

42 

.26 

66 

74 




K —Motor Coordination 

— 04 

13 

06 

- 03 

29 

29- 

76 



F — Finger Dexterity 

- 05 

- 03 

- 03 

01 

27 

20 

37 

65 


M — Manual Dexterity 

- 06 

06 

01 

- 03 

23 

16 

49 

46 

73 


All reliabilities are based on retests after three months for about 1000 boys Intelcoi relations 
are for ft sample of 100 boys and girls No correlation is given for G with V, N, or S, since 
these tests are included within G 

Souilce Guide to the Use of GATB, 1958 

The GATB is designed with an efficiency that has nevei been exceeded. 
The woiking times for pencil-paper tests are close to six minutes each The 
psychomotor tests requue even less woiking time, but seveial minutes aie 
used for demonstration and piactice The entire batteiy can be given m two 
and one-quaiter hours The piocedures aie simple enough to allow bust- 
worthy administration of the tests by relatively untiamed testers to subjects 
who have limited education 01 pooi command of English The psychomotoi 
tests are so designed that each subject leaves all the matenals as he found 
them, ready foi the next subject. No doubt much has been sacnficed for effi¬ 
cient adnunistiation The marked speeding of neaily all the GATB subtests 
may reduce their validity for many purposes One cannot expect to measure 
with the piecision of the DAT, using subtests only one-fifth as long .. h 
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l With its access to woikers in all aieas of the countiy, all types of industry 
and agricultuie, and most occupational levels, the USES was able to obtain 
a highly lepiesentative noimative sample. Four thousand cases were diawn 
fiom the lecoids on hand to form a gioup in which all occupational, sex, and 
age groups were piopeily lepiesented m piopoition to census data Scoies 
on the factois aie expiessed in standaid-scoie foim, with a mean of 100 and 
s ,d of 20 

Correlational data foi the GATB aie piesented in Table 35 (These data 
are selected fiom seveial tables in the technical manual for the test and aie 
not based on the same sample ) We note the usual common factoi mnnmg 
though the pencil-paper tests, and anotliei factor linking the psychomotor 
tests In geneial, the test mteiconelations aie lpw enough to give some piom- 
ise pof meaningful sepaiation of aptitudes 4 

5, Compare the reliabilities of the DAT and GATB How much was sacrificed by the 
use of short tests in GATB? What would the reliability of the S score be if the 
test were extended from six minutes to thirty minutes'? 

6. How do you account for the overlap of scores P and Q, which seem to involve 
neither reasoning nor dexterity, with the remainder of the battery? 

7 Are local norms or national norms most relevant in occupational guidance? 

8 The median coefficient of stability for GATB for high-school students is 81, but 
for adult applicants at employment service offices it is 89 Account for this dif¬ 
ference. (Time between tests is short in both cases) 

9, Table 36 indicates stability of GATB scores over intervals of several years. 
Which aptitudes are stable enough to be used confidently for ninth-grade 
counseling 1 ? Which aptitudes appear to stabilize late in high school? 


TABLE 36. Stability of GATB Scores 


Correlation with 12th-Grade Scores 
of Tests Given in Grade 

8 9 10 11 

N = 53 N = 61 N = 61 N = 53 

G—General 

.75 

82 

.80 

84 

V —Verbal 

70 

76 

73 

82 

N—Numerical 

.76 

77 

.81 

85 

S—Spatial 

76 

86 

86 

88 

P —Form Perception 

61 

65 

71 

75 

Q—Clerical Perception 

77 

80 

86 

89 

A—Aiming 

55 

58 

69 

64 

T —Motor Speed 

59 

61 

78 

75 

F—Finger Dexterity 

.59 

66 

68 

72 

M —Manual Dexterity 

65 

65 

.71 

73 


Souiice Unpublished results supplied by Dr Beatrice Dvorak 

Relation of DAT to GATB 

One study lias applied both DAT and GATB to the same high-school sen¬ 
iors, and the mtercorrelations (Table 37) shed light on both tests, Each DAT 





276 ESSENTIALS OF PSYCHOLOGICAL TESTING 


score has its highest correlation with the con esponding GATB factor, except 
that DAT-VR and DAT-NA have higher correlations with GATB-General 
than with GATB-V and -N. The general factor has substantial influence m 

TABLE 37. Correlation of DAT and GATB Scores 


GATB Scores 




G 

V 

N 

S 

P 

G 

r 

F 

M 







Form 

Cler 

Motor 

Finger 

Manual 



General Verbal Number 

Spatial 

Perc 

Perc 

Speed 

Dext 

Dext 


Verbal 

78 

72 

54 

54 

21 

41 

.29 

20 

- 03 

1/) 

Spelling 

66 

66 

.57 

21 

03 

51 

32 

08 

.10 

£ 

Sentences 

74 

75 

.56 

36 

05 

33 

33 

17 

12 

U 

Numerical 

66 

52 

62 

32 

.01 

22 

.27 

13 

.05 


Abstract 

68 

48 

45 

56 

14 

26 

.21 

17 

00 

D 

Space 

.59 

49 

.24 

72 

21 

22 

19 

35 

11 


Mechanical 

62 

56 

25 

68 

13 

09 

.24 

.39 

08 


Clerical 

25 

18 

.33 

.07 

46 

53 

61 

27 

46 


The values given are those for high-school bovs, coirelations for girls aie similar, but 
generally lower CoiTelatinns over SO are in boldface type 
Soub.ce Guide to the Use of GATB , 1958, p L-l 


every DAT scoie except Cleucal, which con elates with all the GATB speed 
tests 

The GATB factois P, F, and M measure aptitudes not covered in the DAT 
batteiy DAT-Mecliamcal Reasoning has no counteipart in the GATB, al¬ 
though it oveilaps G and S to a considerable degree DAT-Spelhng and Sen¬ 
tences overlap considerably with Veibal Reasoning 

10. For what types of guidance does the content of GATB'seem more useful than 
that of DAT? For what types is it (ess useful? 

11. What do the correlations for DAT-Clerical tell about its meaning? 

12. DAT and GATB spatial tests correlate 72, but each correlates only .50 with 
PMA-Spatial How do you account for this’ 

13. Make composition diagrams to show the overlap and unique content of these 
pairs of tests 

a DAT-NA and GATB-N 
b. DAT-MR and GATB-S 

14. Why does MR have a large spatial loading here, when a similar test showed 
no such factor in Table 30? 


SPATIAL ABILITY 

We cannot examine separately the psychological and piactical significance 
of eveiy factor so far isolated, or even of all the scoies m the test batteries 
undei discussion We have selected spatial leasomng and mechanical com¬ 
prehension as examples for close attention. After leviewing evidence on 

J 
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(these two factors, we shall return to a general discussion of the test batteries. 
Psychomotor abilities will be fuither consideied m the next chapter. 

Spatial ability was present m some eaily nonveibal tests of general ability, 
but it was soon lecognized that tests calling for comprehension of foim re¬ 
lationships were not measunng the same thing as tests like Picture Arrange¬ 
ment which requned comprehension of ideas. Early investigators of voca¬ 
tional aptitudes identified a number of jobs which seemed to requue facile 
reasoning about foims, and spatial tests have since played a part in neaily 
all lesearch on vocational aptitude The DAT manual speaks of Space Rela¬ 
tions m this way (Bennett et al , 1959, p. 7) 

The Space Relations test is a measure of ability to deal with concrete 
materials through visualization. There are many vocations m which one 
is requned to imagine how a specified object would appeal if lotated in 
a given way This ability to manipulate things mentally, to create a 
structure in one’s mind fiom a plan, is what the test is designed to evalu¬ 
ate It is an ability needed in such fields as drafting, diess designing, 
architectuie, art, die-makmg, and decoiatmg, oi wheiever there is need 
to visualize objects in thiee dimensions. 

Theie appear to be scveial distinct spatial abilities Comprehending static 
objects (as m Block Counting) seems to involve something quite different 
from visualizing how an object oi machine will look after certain movements 
take place (Cuilfoid, 1947, pp. 269-296; Michael et al, 1951) A visualiza¬ 
tion factor (Vz) is found in tests such as Binet paper-folding and in some of 
Thurstone’s tests where the subject must visualize how a figure will look 
when rotated. 

Validity in Educational Prediction 

One might expect spatial ability to be relevant to high-school courses 
such as geometiy, shop, and engineering di awing Validity coefficients for 
many schools aie available in the DAT manual, some of which are reported 
in Table 38 For companson, coefficients are also given for Numerical Ability 
and Abstract Reasoning The coefficients reported aie based on boys, but 
results for gills aie similar 

Looking fiist at the coirelahons for geometiy, we see that lesults from one 
sample to another vary, sometimes mystenously. The two White Plains sam¬ 
ples come from the same school m the same year, but coefficients in one class 
are strikingly higher than in the other. If differences such as this occur within 
one school m a well-defined couise, it is obvious that generalizations about 
validity are hazardous, 

In all schools, SR has positive relations with geometry, but NA is a better 


/ 
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predictor Insofar as we can judge from these coefficients, the contribution of 
SR to prediction of geometry is accounted for by its general-factor content. 
Other spatial tests show similar results. Though geometiy undeniably le- 
quires reasoning about forms, tested spatial ability accounts foi little of the 
variation m geometry marks Heie again we encounter evidence warning 
the test user against trusting his judgment as to what a test is likely to pre¬ 
dict 

Note also, from this example, that the impoitance of a test cannot be 
judged solely from its conelation with the criteuon Consideied alone, spa¬ 
tial ability has modest validity. Consideied alongside other piedictois, we 
find that the predictive value of the test is due to its general-faetoi content, 


TABLE 38 Some Validity Coefficients for Differential Aptitude Tests 
Against Course Grades 


Course 

Grade 

Location 

Time Be¬ 
tween Test 
and Marks 

Number 

of 

Cases 

Correlation of 
Marks with 

Nu- Ab- 

Space merical stract 

Plane Geometry 

10 

St Paul, Minn 

1 year 

48 

32 

47 

24 


10 

White Plains, N Y. 

1 year 

70 

.20 

34 

19 


10 

White Plains, N Y. 

1 year 

77 

.53 

57 

56 

Solid Geometry 

12 

Baltimore, Md. 

1 year 

47 

13 

33 

41 


12 

Hamilton, Ohio 

1 semester 

42 

.18 

61 

25 

Art 

8 

Yonkers, N Y. 

1 year 

471 

.20 

23 

16 

Mechanical 

9 

Worcester, Mass. 

1 semester 

44 

34 

41 

21 

Drawing 

10 

Gloucester, Mass. 

1 year 

46 

02 

17 

43 


10 

Independence, Mo 

1 year 

44 

57 

49 

28 

Shop 

9 

Worcester, Mass 

1 semester 

142 

26 

.27 

22 

8 

Yonkers, N Y. 

1 year 

471 

18 

28 

.14 


10 

Independence, Mo 

3 months 

42 

07 

06 

.41 


8 

Schenectady, N.Y 

1 semester 

81 

33 

28 

.50 


Source Bennett et al , 1959, pp 42 If, 


The essential question about the piactieal value of a test is how much it 
adds to what other measures can tell 

The remaining coefficients in Table 38 tell the same story: variation fiom 
class to class, generally small positive couelations of SR with the cutcuon, 
equally good correlations for nonspatial tests These data, and data on other 
tests, point to the conclusion that spatial ability does not, per se, predict 
success in high-school courses 

A study of college mathematics grades was made by Hills (1957), using 
Guilford’s experimental tests of new masoning factors He included two 
separate spatial measures from the Guilford-Zimmerman Aptitude Survey 
One, Spatial Orientation (a measure of S), shows pictures across a boat’s 
prow, as seen from the cockpit. The pictures are paired, and the task of the 
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^subject is to locate m the second pietuie the “aiming point” toward which 
’ the plow was pointed m the first scene. The second, Spatial Visualization 
(Vz), lequnes the subject to identify how a clock will appear when tilted 
and lotated in a sequence of movements described verbally. Hills found con¬ 
sistent conelations of S with cnlena in several mathematics couises foi 
engineeis, coefficients being as high as 55 In couises foi physics and 
mathematics students at the same level oi mathematics, howevei, S had 
negligible validities Hills also found that the lelevancc of the factor to a spe¬ 
cific course (eg, calculus) depends on how the course is taught Validities 



FIG 49 Guilford Zimmerman Spatial Orientation items The subject is to mark whichever an¬ 
swer shows the position of the boat's prow (represented by the bar) in relation to the original 
aiming point (dot) The answers to the three items are C, B, and E respectively (Copyright 1947, 
Sheridan Supply Co , and reproduced by permission ) 


for Vz weie much smallei than foi S m the engmecung sections but were 
consistently laigei m sections foi physics students S gave a laigei numbei 
of substantial validity coefficients than any other of the reasoning factors 
tested Hills’ results hint that special abilities may be more valuable as dif- 
feiential predictors m advanced couises than in high school Special abilities 
contribute little to prediction of oveiall grade averages, since no ability save 
verbal or numerical affects many courses. 

15. Give possible explanations for the differences between the two White Plains 
samples in Table 38. 

16. How can one explain the negligible importance of spatial ability in predicting 
geometry? 

Occupational Validity 

The chief value oi spatial tests is m vocational choice and employee 
selection, A study of watch iepaiiing, for example, indicates a marked 
correspondence between spatial ability and peifoimance, the validity co¬ 
efficient being 69 (Bennett et al , 1959, p 63) Gliiselli’s summary of pub¬ 
lished leports (1955) shows that spatial relations tests have predictive 
validities avei aging greater than 30 for either training success or job 
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pioficiency in protective occupations, service occupations, mechanical repair¬ 
men, electrical workeis, structural workers, piocessmg workers, operators of 
complex machines, and gross manual workers. 

A tiemendous volume of information on vocational coil elates of spatial 
ability is provided by the USES Table 39 gives a sample of then results, 
along with data on General, Form Perception, and Manual Dextenty scoies 
These data weie gatheied on persons woiking in 01 tiammg foi the occupa¬ 
tion who had alieady been selected to some extent, as is shown by the fact 
that the mean score departs fiom 100 and the s.d is below 20 Evidently 
only pei sons veiy supenoi m space ability get into engmeeimg and dentistry 
courses Dnll-piess operators, at the other end of the scale, are drawn fiom 
the below-average woikers who lemam aftei those with bettez aptitude aie 
siphoned into other jobs The validity coefficients foi occupations wheie the 
s d is low would be much largei if an unselected gioup had been hned 
Spatial ability is also important in several of these occupations. Both gen¬ 
eral and spatial ability contnbute to success as draftsman oi tabulatmg-ma- 
chrne operator, dentists, engineers, and machinists need foim perception in 
addition Caieful distinction between aptitudes is impoitant foi job assign¬ 
ments Although S and P aie both, in a sense, "spatial,” S is impoitant in 
dentistiy lecture courses while P is not Foi bomb-fuse assemblers, the quick 
perception tested by P is much more important than the leasonmg tested in 
S The ladio-tube mountei is likewise engaged in assembly of small parts, 
but his success depends on dexterity, not on S oi P 
Few of the conelations m Table 39 aie large. Spatial ability alone does not 

TABLE 39 Validity of GATB-S Against Occupational Criteria 


Number Comparable 

of Spatial Aptitude Correlations for 

Occupation Cases Criterion Mean s d r G p M 


Dentist 

96 

89 

Lecture grades 
Laboratory 
grades 

132 

14 

.29 

33 

.24 

.13 

- 02 

33 

- 18 

14 

Engineer 

150 

School grades 

134 

15 

.11 

42 

.11 


Draftsman 

40 

Ratings 

126 

12 

32 

42 

06 

.24 

Machinist 

Tabulating-ma- 

71 

Ratings 

114 

18 

37 

29 

.27 

.08 

chine operator 
Bomb-fuse parts 

203 

Ratings 

106 

18 

20 

.34 

10 

10 

assembler 
Mounter (radio 

90 

Ratings 

Production 

102 

15 

.12 

.21 

33 

31 

tubes) 

100 

records 

101 

14 

- 02 

03 

— 02 

54 

Upholsterer 

49 

Ratings 

97 

17 

43 

24 

25 

32 

Poultry laborer 
Drill-press 

72 

Ratings 

Production 

95 

16 

.03 

.24 

09 

.56 

operator 

31 

records 

83 

18 

05 

32 

22 

47 


Values in boldface ore significant (P> 05) 
Source Guide to the Use of GATB , 1958, III J 


J 
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account for success in any of these jobs. Taking all the aptitudes into account 
simultaneously, however, can greatly impiove employment decisions. In 
Chapter 12 we shall explain some of the procedures used to combine apti¬ 
tudes into a selection formula 

17. Explain why the validities of the GATB tests are different for the two dentistry 
criteria 

MECHANICAL COMPREHENSION 

We have pieviously discussed the Bennett TMC, which is the prototype foi 
the Mechanical Reasoning Test of the DAT batteiy No mechanical compre¬ 
hension test was included in the USES battery, on the assumption that othei 
tests in the batteiy covei much of what such a test would measure As Table 
37 showed, DAT-MR conelates about 60 with tests of G, V, and S Factor 
analyses of an Au Force test patterned after the TMC, liowevei, indicate 
that about 35 peicent of its vaiiance comes fiom Mechanical Experience, 
25 percent fiom Visualization, and only 12 percent horn G, V, and S com¬ 
bined (Guilford, 1947, pp 336-339) These reports aie less contradictory 
than they pcihaps appeal, since each analysis is based on different test bat¬ 
teries and statistical proceduies The Air Foice analysis is the more satis¬ 
factory, being based on far moie data and lepoitmg conelations with factors 
rathei than with single tests. 

The validity coefficients for mechanical comprehension against high- 
school marks lun a bit lowei than coefficients foi other abilities. In the DAT, 
the median conelation of MR with science giades foi boys is 40, compared 
to VR, .54, NA, .52, AR, .42, and SR, 34 (See also Table 11.) Quite similar 
results are obtained for the Multiple Aptitude Tests, another high-school bat¬ 
teiy. 

Adaptations of the Bennett test have frequently piedicted success in civil¬ 
ian and mihtaiy technical specialties (Bennett and Fear, 1943) The British 
Aimy found that a form of the Bennett test had a validity of 59 foi selecting 
tiuck duveis, no other test was ncaily so good (Veinon and Parry, 1949, p 
230). Among the aveiage validity coefficients calculated by Ghiselli (1955) 
on the basis of (lie published hleialuie, mechanical comprehension had the 
following notable validities foi cither training oi job-peifoimance critena: 

50 to 59 machining woikeis, bench woikeis, and assembleis 
.40 to ,49 protective occupations, electucal woikeis, piocessing workers, 
complex-machine operatois, mspectois 
30 to 39 mechanical repanmen, welders, vehicle operators, structuial work¬ 
ers 

Of particular interest is an Air Foice study in which a factor analysis of 
pilot success was made. Out of 26 independent factors considered, the two 
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most significant for pilot success were Spatial and Mechanical Experience 
(followed closely by Integration, Visualization, Psychomotor Coordination, 
and Pilot Interests, Guilford, 1947, p 843) The Mechanical Pnnciples test 
had a validity of about 35 as a piedictoi of pilot success 

A second type of mechanical test requues subjects to identify pictures of 
tools and is thus a measure of acquaintance rathei than understanding A 
recent test of this type is illustrated in Figuie 50 The subject is to find the 



FIG 50 Part of the Mellenbruch test (Copyright 1957 by Psychometric Affiliates and 
reproduced by permission.) 


letteicd picture that goes with each numbeied pictuie, Knowledge about a 
field may be regaided as an indication of interest in it, if people have more 
or less equal opportunities to get such mfoimation Veibal tests of inhuma¬ 
tion about machinery, medicine, cuirent events, spoits, etc, may be useful 
m vocational piediction 

The U S Employment Service has done much to develop trade tests for 
use where an applicant claims to know a paiticulai job These questions 
about die job m effect constitute a slioit inteiview Many men who claim ex¬ 
perience in a trade fail on the questions Such a screening test, used m an 
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employment center, eliminates those who might otherwise be shipped 
across the countiy to a plant wheie skilled men aie needed Trade tests are 
also used m military classification to check whether men are qualified m the 
trades wheie they claim civilian experience In the British Aimy such tests, 
because of then leliability, weie sometimes moie dependable bases foi as¬ 
signing men than lccoids made m tiaimng couises (Vernon and Pany, 
1949, p 244) 

Tiade questions aie selected to cover job piocesses and tools Questions 
that would be unfaii because of legional differences m methods of woik or 
vocabulaiy aie eliminated To check item validity, three cuterion gioups 
are tested, expert woikeis, begmneis m the trade, and woikers m closely le- 
lated tiades The items which discriminate these gioups aie letaincd. 
Items hom seveial tests aie (Stead et al , 1940) 

(Caipentei) What do you mean by a “shoie” m eaipentiy? Ans Upnght biace 
(Plumber) What aie the two most commonly used methods of testing plumbing 
systems? Ans Watei, smoke, peppeimint, an (any two) 

(Asbestos woikei) In stitching canvas coveiing ovei pipes, wheie is the seam 
iun? Ans Out of sight, back oi top of pipe (eithei). 

A good tiade Lest discriminates between novices, apprentices, journeymen, 
and expei ts In Figuie 51 we see how a test of engine-lathe opeiatois func¬ 
tions Such a distnbulion of scoies peimits one to classify a job applicant 
with little enoi; a scoie of 22 almost ceitainly indicates a journeyman 
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FIG 51 Scores on o trade test for engine lathe operators (Burtt, 1942, 
p 493) 


18. a. The Purdue Assembly test is designed to include mechanisms using each im¬ 
portant mechanical device gears, levers, rack-and-pmion, etc Does such a 
test assume that mechanical aptitude or comprehension is a single general 
ability, or that it is a group of specific abilities 7 
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b. If the latter theory is true, what implications does it have for selecting stu¬ 
dents for training in watch repairing? 

19, Boys surpass girls on the Bennett test How may this finding be explained? 

20. Mellenbruch reports validity coefficients ranging from .50 to 60 for his me¬ 
chanical aptitude test The criteria used are teacher's ranking of engineering 
drawing trainees (women), experience in mechanical activities, and scores on 
the Air Force Mechanical Information Test What other validity studies are 
needed to support his recommendation that those scoring low on the test should 
not be hired for mechanical work or should be placed only in routine mechani¬ 
cal jobs? 


THE INTERPRETATION OF APTITUDE PROFILES 

.Differential ability tests are used in two ways for institutional decisions and 
'for individual decisions (Cionbach and Gleser, 1957); An institutional deci¬ 
sion is one in which a factory, a school, a military organization, or the like 
selects and assigns individuals m ordei to obtain the best total result, i e , the 
greatest possible attainment of institutional goals This use of the tests lests 
primarily on efficient statistical combination of scoics lathci than on psy¬ 
chological mterpietation An individual decision is one which seeks to pio- 
mote the welfare of one person, consideied by himself. In caieei guidance, 
for example, the emphasis must be on psychological mteipietation We shall 
concern ourselves heie with the use of piofile mfonnation m individual de¬ 
cisions, and turn to institutional decisions in Chapter 12..' 

Perhaps there was once a hope among counselor that a test profile 
would peimit a definite, final choice of vocation at the time the tests aie 
given If this were the case, the counselor and client together could reach a 
decision, and the client could lely on the counselors mteipietation of the 
tests. Today it is lecogmzed that the client himself must fully understand 
the test results, foi two reasons 

One leason is that vocational choice is not a single final tillow of the dice 
As a person goes thiough school and into his fiist jobs, he has many occasions 
to nanow his field of concentiation or even to tiansfei to a new aiea High- 
school courses and intioductoiy college coinses piovide oppoitmnUes foi 
him to exploie and develop aptitudes and interests In an expanding econ¬ 
omy, workeis change position or change lesponsibilities within the same es¬ 
tablishment. The engineer m a technical firm, foi example, may become a 
manager, a salesman, a creative designer, oi an expeit on detailed specifica¬ 
tions ( QVise choice lequues self-understanding, no "piescnption” filled out 
by a tenth-grade or freshman-year counselor can anticipate these~subse~ 
quent decisions. Test interpretation is only one step m a long piocess of self- 
discovery, 1 

Secondly, the client is more likely to accept recommendations which he 
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Understands The counselor may be convinced that a freshman should get 
out of engineering and into advertising.-Even though adveitising is con¬ 
sistent with the boy’s talents and mteiests, he may lesist 01 ignoie the recom¬ 
mendation If he has been visualizing himself as an engmeei for yeais, such 
a change of program lequnes lnm to altei his entne self-concept and may 
seem like an admission of defeat To accept the new goal requires that he 
understand the facts the counseloi considers significant Acquiring a new 
self-image requnes both factual and emotional learning 1 
The counseloi must decide what meaning may justifiably be extracted 
from scoies and must at the same time considei how this infoimation is to be 
communicated so that it affects the client’s conduct. 


Limitations on interpretation 

A geneial ability test 01 a batteiy of aptitude mcasuies has definite pre¬ 
dictive value, as we have seen At the same time, the scoies have distinct 
limitations which must be lemcmbeied 

Profile Shape as a Function of the Norm Group It is necessaiy to use test 
norms in order to plot a piofile, and the choice of noims deteimines the pio- 
file shape Piofile slupc changes when a different norm gioup is used, The 
most common example anses in interpreting mechanical compiehension 
scores foi guls (see p 92) 

The USES piofile is ordinarily plotted against norms for adult woikeis 
The piofile (Figuie52) of a (hypothetical) student engineer plotted in tire 
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FIG 52 Two GATB profiles for the same student engineer 
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usual manner (upper profile, Figure 52) draws attention to his superior 
G, V, and S abilities, and shows him near the average in dexterity If in¬ 
stead we use a standard-score conversion from data on engineering students, 
lus profile (lower profile, Figure 52) takes on a strikingly different appeal- 
ance His gieatest strength, relative to other engineers, is V In S, he is just 
above average, he is average in G, and behind the group in dexterity 

It is impoitant to compare the person with the group he will associate 
with and compete against rather than with “people-in-general ” 

Precision of Measurement. When we try to measure seveial aptitudes m a 
short penod of time, reliability coefficients often drop to 75 or 80 Even 
with high reliability, a retest would show enough change to suggest different 
recommendations for a certain number of persons These random errois, 
though present, do not cause much concern when tests are used foi institu¬ 
tional decisions Even if a test is seriously wrong in 10 percent of the cases, 
the decision maker reaches correct conclusions fai moie often than he could 
with other data If an unintelligent man slips by an Army screening test, he 
can he detected later and discharged at no enormous cost. In an individual 
decision, however, we cannot be content with a small rate of enoi One er¬ 
ror may alter a pei son’s entire life if the test leads him to decide, for exam¬ 
ple, not to continue his education. 

Suppose it is known that 70 people out of 100 having IQ 110 fail in a cer¬ 
tain profession The counselor cannot make a clear piediction for Waltei, 
IQ 110. Perhaps he would do better if tested again Perhaps other qualities 
unknown to us make Walter one of the 30 who would succeed, lather than 
of the 70 who fail. Almost neoei are psychological tests so valid that a predic¬ 
tion about a single case is certainly true 

The counselor who is conscious of unreliability adopts many precautions to 
reduce its ill effects He checks each test result against the case history for 
consistency If in doubt, he confirms significant test findings by a second 
comparable test. He examines his case for special factors such as language 
difficulty which might make die test invalid Most important, he thinks of a 
test performance as placing the subject m a piobable range of scoies, rather 
than as peggmg him firmly at a particulai pci centile. Tests rarely miss fixe in 
stating that a student is “somewhat, but not extremely, below aveiage m 
scholastic aptitude ” The statement "Walter is at the 32nd percentile of col¬ 
lege freshmen” is almost certainly untrue, in the sense that fuither data 
would not precisely confiim it 

Clients and such professional workers as teachers, physicians, and social 
workers may place false reliance on test data which they regard as “scien¬ 
tific.” Even when the tester’s report is carefully qualified, the person receiv¬ 
ing the report is likely to remember only portions of it. A parent, learning 
that his child’s IQ is 87, may forget the tester’s cautions about what the test 
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does not measure, the possibility of giowth or decline m IQ, and the approxi¬ 
mate nature of predictions from it Instead, the figure itself may stick 
vividly in mind and be used as a basis for significant decisions for years to 
come 

Profile Reliability, Special difficulties aie encounteied in inteipietmg multi¬ 
score tests wheie judgments lest on differences between scoies Such differ¬ 
ences are usually much less leliable than the scores themselves, DAT-VR 
and DAT-NA aie reliable, foi example, but they overlap, and much of the 
reliability of each scoie is due to the oveilapping part. When that is sub¬ 
tracted, the remaining test vanance contains a high piopoition of eiroi 
(Thorndike and Hagen, 1955, p 178) 

The reliability of a diffeience between two standard scoies A and B is cal¬ 
culated by this foimula 


r ( 1-«)(M1) 


Aia + inn — 2 r A 
2 - 2 r iB 


When tests have low reliabilities oi a high degiee of oveilap, the difference 
is highly unreliable Using the data of Table 34, we find that in the DAT the 
reliability of the VR-NA diffeience is 76 That of VR minus CSA is .82 
Small diffeiences aie geneially chance effects When a diffeience be¬ 
comes twice as huge as its slandaid ciroi, theie is only one chance m 
twenty that the peison is equally good on both tests. We can have substantial 
confidence that a letcst would confiim such a diffeience Table 40 indicates 


TABLE 40 Interpretability of Difference Scores 


Difference in Proportion of Sub|ects Showing In 

Average T-Score Units terpretable Difference if Test 

Reliability of Required for Intercorrelation Is 


Profile Scores 

Interpretation 

00 

25 

50 

75 

95 

63 

66 

61 

53 

38 

.90 

88 

54 

47 

38 

22 

80 

125 

37 

31 

21 

8 

70 

153 

28 

21 

13 

3 


In tins table, an Jiifcopretafolo diffeience is dcflmrl ns one winch would occur only 
ono time in twenty, in testing poisons whose two nbilitus aie actually equal 

how laige a diffeience must be to allow this degiee of confidence For 
DAT-VR and -NA, the average leliability is near 90 The table tells us that 
a diffeience between these scoies must be at least 9 points to be significant. 
A difference smaller than that indicated by the table should be legarded only 
as a suggestion, to be confirmed by other data If two tests are highly corre¬ 
lated, few difference scores will be large enough for interpretation. The test 
profile then is not very useful for differential measurement. 

Test developers are giving increasing thought to ways of reporting scores 
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so that their unreliability will be kept in mind. One device is the report form 
for an educational achievement test shown in Figure 53. Here the pupil’s 
score is shown, not as a point on the scale, but as a range within which his 
ability almost certainly falls The width of the band is twice the standard 



FIG. 53 Profile for the Sequential Tests of Educational Progress The shaded 
areas for Mathematics and Social Studies overlap, there is no important differ¬ 
ence in standings on these two tests The same is true of Mathematics and Sci¬ 
ence. However, the shaded areas for Science and Social Studies do not overlap 
The student is higher in Social Studies than in Science ability, as measured by 
these tests (Copyright 1958, Cooperative Test Division, Educational Testing 
Service, and reproduced by permission ) 

error of measurement The student can see from the profile that about four 
peisons in ten. surpass his mathematics score, and that the difference be¬ 
tween social studies and mathematics is not reliable, 

21. Calculate the reliability of a difference between Spelling and Sentences in 
DAT 

22 Which pair of GATB tests appears to have the least reliable difference? Com¬ 
pute its reliability. 

23. Examine the DAT profile in Figure 16. Which score differences are reliable 
enough to interpret 7 

24 In the PMA tests for ages 5 to 7, the correlation of V with S is .60 The relia¬ 
bilities are 77 and 86, respectively. What can you say about the interpref- 
ability of the V-S difference? 

Stability of Aptitudes Vocational guidance involves an attempt to predict 
success far into the future This prediction cannot be made unless the 
aptitude pattern is stable over a long period of time Measures of general 
ability have substantial stability aftei about the age of 9, when the initial ad¬ 
justment to schooling is completed. But how eaily does the pattern of spe¬ 
cialized aptitudes emerge? 

The DAT is designed for use as early as Grade 8 One study of its stability 
over time, based on tests m Grade 9 and retests in Grade 12, gives coeffi¬ 
cients of stability for boys above 80 for Verbal and Spelling, near .70 for 
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Sentences, Numerical, and Mechanical; and near .60 for Space, Abstract, 
and Clerical (Bennett et al , 1959, p 68) Much of this stability is no doubt 
due to the geneial factoi Similar results weie repoited for GATB (Ta¬ 
ble 36). 

The real question is the stability of diffeiences within the piofile When 
ninth-grade differences wcie coirelated with twelfth-grade differences, the 
conelations langed fiom 20 (Numeiical minus Abstract) to 74 (Mechanical 
minus Spelling) (Doppelt and Bennett, 1951), Diffeiences among Clencal, 
Mechanical, and the oveiall level of the veibal-language-numerical tests are 
stable enough to be taken senously. It is doubtful if long-iange predictions 
can be based on Space scores in Giade 9, oi on diffeiences between Veibal, 
Numerical, and Abstiact scoies m that giade. 

In view of the inadequacy of present data on the stability of factoi scoi es, 
a firm conclusion cannot be leached The following statement is a “best 
guess” as to what moie complete leseaich will show Special ability tests 
may have some use foi slioit-leim picdiction and classification in elemen- 
taiy school Tins is suggested by Reed’s finding (1958) that the PM A spatial 
score (visual disciimination) conelates .41 with achievement m pnmaiy 
reading wheicas veibal ability con elates only 27. At highei guides, V coi- 
relates .52 and S only 18, a finding which leflects the shift in teaching em¬ 
phasis, aftei basic skills aie established, from peiception to compiehension 
Special ability tests aie often lelcvant m studying childien lcquiung reme¬ 
dial help For most elementaiy pupils, guidance is best based on geneial 
measuies of veibal and nonveibal ability lathei than on moie claboiate pro¬ 
files whose implications foi mstiuction aie unknown. 

Theie can be little long-range diffeiential prediction befoie Giade 11 In- 
Giades 7-10, aptitude tests suggest strong points so that the pupil will be en¬ 
couraged to emoll m couises wheie these assets will be developed In these 
giades low scores need not be consicleied senously save wheie, as in Num¬ 
ber and Spelling, remedial instruction can laise the scoie By mid-adoles¬ 
cence, the individual’s aptitude pattern is reasonably stable Even at this 
age, irreversible decisions should be avoided Latei couises and job experi¬ 
ence will add greatly to the student’s knowledge of his aptitudes. 

Meanings Attributed to Scores Foi counseling, scoies must be explained in 
common-sense tenns The client will continue to face choices between 
courses and between job openings, and the counselor cannot possibly give a 
recommendation that will anticipate all such questions He must help the 
client to undeistand his own piofile and to undeistand what tasks the vanous 
aptitudes are lelevant to 

The DAT and GATB piofiles aie well designed for such interpretations 
Labels like Numerical Reasoning and Spelling do not sound like mysterious 
mborn aptitudes, they are clearly measures of a ceitain type of performance. 
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The safest way to interpret scores is in terms of the items that constitute the 
test; i e, "This score shows that you do well on problems like this ” Any 
more elaborate interpretation leads quickly to misundeistandmg Mechani¬ 
cal reasoning is misinterpreted as “mechanical aptitude” though the test 
clearly does not cover dexterity. The clerical test is misinteipieted as a pre¬ 
dictor of success in stenogiaphy and typing whereas it actually covers lapid 
checking of details, impoitant only in veiy routine office jobs The student 
may connect spatial ability to art, geometry, and shop couises even though 
the validity coefficients discouiage such an inteipretation 

Some degiee of vagueness is absolutely essential The student should be 
made to feel that he can improve many of his aptitudes He should regaid 
the test findings as hints to be checked m other expenence Nothing m our 
expenence with testing justifies making film individual decisions on the basis 
of differential abilities 

Tile case of Sarah Carrell provides an illustration of many of the comments 
we have made (Bennett et al , 1951) 

Early m hei junior yeai, Sarah talked ovei her test seoies with the counselor 
Her school work had been satisfactory She then appealed for help in persuading 
hei mothei that it was worth while to finish high school The mothei wished hei 
to go to woik since liei fathei had been foiced to letue on just a small pension The 
mothei felt Sarah was over-age (illness m childhood had letarded hei one yeai), 
and that she would not do well m secietanal training because hei school giades 
were not above aveiage Moieover, none of Saiah’s older sisteis had giaduated 
from high school and the mothei consideied high school of little value foi a gnl 

Saiah’s DAT piofile showed that she fell m the middle range of high- 
school juniois Her Spelling and Sentences seoies weie hei lowest, at tire 
25th peicentile Her peaks weie Numerical (75th percentile) and Abstract 
(70) All othei scores weie at the median In Giade 9, a reading test had 
placed her at the 58th peicentile, and tire Otis gioup mental test in Grade 8 
at the 47th peicentile All these agiee with the DAT in indicating that Saiah 
had enough ability to finish school 

The test record was useful in showing Sarah’s mothei that the gul was 
supenor in numencal and abstract perfoimances The counseloi pointed out 
that Sarah could expect to do well m calculating and bookkeeping, winch she 
could take if she stayed m school (NA, AR, and Sentences aie the best pre¬ 
dictors of bookkeeping maiks ) “The mothei” says the counseloi, “then ad¬ 
mitted that her secret desire had been foi Sarah to woik m an insurance 
office where her brother-in-law could secure her a job She conceded that if 
Sarah was that good, she ought to have a chance to finish school ” 

Saiah was deficient m language usage, and the counselor should point to 
its importance in office woik If tins deficiency is repaued, as nray well hap¬ 
pen when study is motivated by a definite goal, Sarah could qualify for al- 
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most any office job at a modest level of responsibility If this deficiency le- 
mains, the test analysis lias shown that hei best opportunity for success is 
in bookkeeping 01 the like 

The DAT scoies of Robeit Finchley (Figuie 16) contiadiet his scores on 
othei tests His Otis scoie was at the 55th peicentile, his leading speed at 
the 24th, and his compiehension at the 50th But on the DAT, he had these 
peicentile scores (Bennett et al, 1951) 

VR NR AR SR MR CSA Sentences Spelling 

95 95 97 92 95 10 14 9 

His paients aie college graduates, and his sister has a good school lecord 
Robeit’s iecord had declined steadily dining all his school yeais, and m high 
school he was doing little of his assigned woik. The DAT had been given 
routinely in Giade 10, but no effort was made to discuss it vuth Robert, oi 
even with his teachers, until a yeai latei 
The stoiy of the test scoies is cleai outstanding oveiall ability, with a se- 
veie deficiency in clencal speed and in language usage Fiom the case his- 
toiy it appeals that Robeit’s teacheis had begun to legaid him as a mediocie 
student who could not be expected to do well, and that he had come to shaie 
their opinion Robeit liunself was openly delighted with ihe test repoit 
and put forth moie effort as lie legamed confidence He became interested 
in obtaining infoimation on schools of engmecimg The iecord suggests a 
need foi lemedial leading, but this could pcihaps bettei be added to 
Robeit’s schedule aftei he gets his cunent woik in hand \ 

25 Would the GATB have given valuable information to supplement Sarah’s DAT 
scores 7 

26. Why did the Otis test fail to reveal Robert's superiority? 

27. Is engineering the most suitable goal for Robert 7 

28. Interpret the profile of Ellsworth Newcomb He has been preparing for engi¬ 
neering, but is making C's in mathematics His tested interests are in verbal and 
personal-contact activities He has done some selling, with success On the OSU 
test, he scores at the 69th percentile of college freshmen His DAT percentile 
scores in grade 12 are- 

Verbal Numerical Abstract Space Mechanical Clerical Spelling Sentences 

86 48 44 40 36 13 73 93 

IMPORTANT GUIDANCE BATTERIES 

Theie is more to he said about mteipieting tests so as to maximize the sub¬ 
ject’s insight, but before expanding on this point we piovide a summary 
listing of some cliffeiential battenes now available to the tester Comprehen¬ 
sive information on these batteries has been compiled and reviewed by 
Super (1958). 
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• Differential Aptitude Tests, George K Bennett, Harold G Seashore, 
Alexander G. Wesman, Psychological Corporation, 1947, 19o2 Grade 8 to 
college A well-constiucted set of eight tests described on pp 269 ff, 

« Flanagan Aptitude Classification Tests, John C Flanagan, Science 
Research Associates, 1953. Higli-school seniors. A seven-hour battery with 21 
tests suggested by Air Force factor analyses In addition to die customary 
aptitudes there are tests foi ingenuity, tapping, speed of scale reading, carv¬ 
ing skill, etc The validity of the tests is still under investigation, and diey 
should be lestncted to lesearch use at present, In paiticular, the occupa¬ 
tional scores” obtained by combining tests should not be used until satis¬ 
factory evidence of dieii validity is provided 

• Guilford-Zimmeiman Aptitude Survey, J. P Guilford and Wayne S 
Zimmerman, Shendan Supply Company, 1947 Measures V, R, N, P, S, 
Vz, Mechanical knowledge Based on factors found useful m An Force 
classification Contains several unique tests which may have predictive 
value, but evidence on piedictive validity m civilian tasks is not available. 
Pnmanly foi reseaich use at present. 

• Holzinger-Ciowder Uni-Factor Tests, Karl J Holzmger and Norman A. 
Crowdei, Woild Book, 1955 Grades 7-12 An excellently constructed test 
which in one hour gives measures of V, S, N, R, and a composite measure 
of scholastic aptitude Reliabilities range from 80 to .90, with unusually low 
intercorrelations. Speed is of some importance in the short subtests. Predicts 
overall giade aveiage very well, but the value of the factoi scores for dif¬ 
ferential prediction appeals to be quite limited. 

® Multiple Aptitude Tests, David Segel and Evelyn Raskin, California 
Test Bureau, 1955 Giades 7-12 Nine tests m three hours cover vocabulary, 
reading, language usage, clencal, anthmetic, mechanical comprehension, 
and spatial abilities Test scoies may be combined into substantially inter- 
con elated scores for V, P, N, and S The battery is technically satisfactory 
and uses tests of familiar types which can be mterpieted by experienced 
counselois. Differential validity for couise grades is not great, and occupa¬ 
tional validities are not established. 

• Tests of Primaiy Mental Abilities. L. L and T. G Thurstone, Science 
Research Associates, 1941,1953 Ages 5-7, 7-11,11-17 The battery measuies 
several of Thuistone’s factors, the list differing at each level. The tests at 
lower giades aie best mterpieted as measuring general ability by verbal 
and nonveibal tests Evidence is inadequate to support diagnostic interpre¬ 
tation or differential prediction The one-hour battery foi ages 11-17 meas¬ 
ures vocabulary, compulation, fluency, space, and leasonmg Until evidence 
to indicate the meaning of piofiles is available, the tests should be confined 
to reseaich use. Incautious and incoirect claims have been made for the 
PM A tests (Anastasi, 1954, pp 114, 365-368, Super, 1958, p. 87) 
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^helping clients use test information 

Client-Centered Counseling 

In earliei days of psychological service, the counselor was often viewed 
as an expeit passing judgment, in the same categoiy as an engineer inspect¬ 
ing a bridge 01 a physician piescubing foi a disease. The modem view is 
that the counselor does not decide 01 dnect, but lather helps the client think 
for himself. In extiemely diiective 01 piescuptive counseling, the “expeit” 
obtains facts, decides, and tells the client what to do. So-called "client-cen¬ 
tered” counseling sti esses the importance of the client’s making his own de¬ 
cisions. This point of view, foimulatcd by Rogeis (1942), emphasizes that 
die impoitant goal is the giowth of the client towaid matunty and adjust¬ 
ment, A peison who has learned to lely on his own judgment has been 
helped moie than one who must seek advice in each new cusis. 

Expeit advice often fails because factual questions aie entangled m emo¬ 
tional attitudes. The hue pioblem is often not the surface pioblcm voiced to 
the counseloi Suppose Stan Howaid, employed on a finishing machine, 
comes to mquiie why he was not piomoted to foicman The duective 
personnel managei might give the facts, based on tests and lalmgs, which 
“prove” that he would make a pool foieman. He may even give Howaid a 
pep talk on how well he pioduces, about the chance of laismg his pay as a 
workman, and about the undesnability of seeking a job wheie ho would fail. 
Howaid is likely to nod his head and leave, but he may be fai fiom con¬ 
vinced he should not be a foieman, lie may quit and go to anothei company 
where he’ll “have a chance ” Howard may have failed to state 01 even to 
recognize that he is anxious to be a foreman because his biothei-m-law is a 
foreman and he wishes equal status. Sumlai “melevant, nonobjective” fac¬ 
tors may luik within the case of the student who studies inadequately, 
the airman who longs to be a pilot,\ the mother who oveaates her child’s abil¬ 
ity, or the unpopulai gill The client seeking counseling phiases his problem 
to piotect the tender spots of Ins ego, Tire counseloi who relieves a suiface 
problem may be helping the client to avoid facing his leal conflicts 

The nonduective methods suggested by Rogeis help the client express his 
feelings The counseloi leflecls the client’s feelings by lephiasing what the 
client has said. “You think you’d lathei be a foicman than a machine oper¬ 
ator”; “It’s discouraging when a man who came after you chd is piomoted 
over you”, "You feel that the management doesn’t tiust you ” Acknowledging 
the feelings, instead of trying to piove them false, piomoles ultimate adjust¬ 
ment The client, fieed from need to justify 01 apologize foi his attitudes, 
gams insight into himself 

The client is made responsible. He asks the questions, limits the area dis- 
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cussed, makes the judgments, and decides when to terminate the counseling 
If the counselor pioposes a test, or suggests that poor arithmetic may be a 
source of difficulty, or lays down alternative solutions, he is taking responsi¬ 
bility, He thereby risks pushing the chent fastei than he is ready to move 
Tests designed to help the tester be wise become of secondaiy impoitance 
in client-centeied counseling because they do not center on the feelings of 
the client Says Rogers (1946). 

The counseling piocess is furthered if the counselor drops all effort to 
evaluate and diagnose and concentrates solely on creating the psycho¬ 
logical setting in which the client feels he is deeply understood and fiee 
to be himself. It is unrmpoitant that the counselor know about the 
client It is highly important that the client be able to learn himself 
(Not to learn about himself, hut to learn and accept his own self ) In 
making use of these pimciples the counseloi examines his own attitudes 
and techniques and endeavors to refine his pioeedures so as to elim¬ 
inate all which are not in accord with the basic pimciples. Thus ques¬ 
tions are eliminated from the interview because they mvanably duect 
the conveisation, advice is eliminated because it assumes the counselor 
to be the responsible peison, diagnosis and evaluation aie put aside be¬ 
cause it has been learned that even when they are not voiced they tend 
to distort the counselor’s responses m subtle ways and to break down 
his full acceptance of chent attitudes. 

Tests aie not abandoned if one accepts this outlook Instead of being 
ways of learning answers about die client so we can tell him, tests become 
ways of helping die chent End out about himself In stuctly nondnective 
counseling, tests enter only when the client asks foi them The student who 
comes with the statement “I’m woiried because it takes me so long to learn 
an assignment” is not immediately seated befoie a battery of tests Instead, 
counseling may go thiougli to completion with no use of tests Perhaps, m 
the course of examining Ins difficulties, he says, “I’ve often womed about 
whether I’m as blight as the students I compete with I thought you people 
had some tests that would tell about that ” Then the counseloi supplies him 
with the means of measunng himself, since he has apparently reached the 
maturity lequn ed to face his question honestly 
Most counselors compromise with the strictly nondirective appioach, but 
its basic idea can be of great assistance It is the lepeated experience of coun¬ 
selor tiymg this technique that when tests are delayed, problems come to 
light which would otherwise nevei have been voiced A student may request 
an intelligence test Given the test, told that his scoie is noirnal, and dis¬ 
missed, he has been leassuied but not necessarily helped Taking the test 
may have reduced his tension temporarily but left untouched the basic con 
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Sflict that set him to wondering about his intelligence Perhaps he is worrying 
about changing lus majoi, perhaps he is concerned because his grades m col¬ 
lege are lowei than m high school The problem may be as remote fiom that 
stated as a wony because lus wife’s family considers his pionounciation pe¬ 
culiar. The counseloi who avoids bunging the conference to a head, giving 
an answer, and terminating the intei view peimits the client to dig into what 
really concerns him 

Bordm and Bixler (1946) suggests that counselors place on the client the 
responsibility of choosing the tests to be taken In conti ast to establishing a 
standard batteiy of tests to be taken, they invite questions about tests and 
discuss at length the sorts of tests available They ncithei recommend a par¬ 
ticular test noi limit then description to the tests the client asks about After 
healing what tests can be had, the client takes the initiative m deciding 
among them This is paiticulaily helpful m eiasmg the idea, common among 
those who seek counseling, that one or two tests will give definite answers 
to every problem 

Decisions made by the counselor apparently have less effect on most 
clients than those they make themselves The counseloi helps the client most 
when he helps lnm to leason out lus own decision. Bixlei and Bixler (1946) 
have made numerous suggestions to mcieasc the client’s involvement in test 
interpictation and Ins self-examination 

The counseloi avoids giving opinions The counseloi is always tempted to 
comment on the goodness oi badness of scoies to build confidence oi em¬ 
phasize the seriousness of symptoms Such evaluation comes between the 
client and the scoie and makes it haider for him to accept the scoie as a 
leality. Bixler suggests piedietion in the form of an expectancy instead 

Bixlei’s second suggestion is that the counselor should be frank Low scoies 
must be faced honestly, if the client is to gain in self-knowledge A test scoie 
inconsistent with the person’s pievious impiession of himself forces lnm to 
take a new look at his plans Students chaiactenstically ovei estimate then 
ability and interest m the vocational field they have chosen Test lesults 
which challenge these distentions can be beneficial, but they obviously gen- 
eiate emotional conflict which the teslci must turn counseloi to dispel. 

What is less obvious is that favoiable test icsults aie equally likely to pose 
problems for the subject Boidin (1951) tells of the college student who 
earned a high scoie on a “scientific aptitude” test because the test included 
achievement items and he had taken considerable science m high school, Al¬ 
though the student “had made a definite choice of business administration, 
he was thrown into a state of indecision by this test result, paitly because his 
father was a successful engineer. Later counseling proved that his original 
choice was well founded and that his indecision would have been short 
lived if the tests had been properly interpreted to him by someone who 
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could also have helped him to relate these results to his peicept of himself as 
different from his father.” 

This does not argue against giving information to the subject. Testing is an 
opportunity for him to find out about himself, and it is better to create a coi- 
rect self-image than to leave him with false impiessions But the counseloi 
must decide what information the peison is able to assimilate One advan¬ 
tage of achievement tests such as the SCAT in college counseling, as distin¬ 
guished fiom tests which appear to measure intelligence, is that the subject 
usually finds it easier to accept unfavorable evidence about his achievement 
than evidence of ‘low intelligence” 

The client must always feel free to reject any interpretation He must be 
able to say that, though his score is low, he expects to succeed He must be 
able to reject his own interest test score by insisting that he leally likes engi¬ 
neering despite a low interest m computation It is only when he learns 
that he need not argue with the counseloi that he becomes free to examine 
himself nondefensively The counseloi should help the client lecognize bis 
emotional reactions to the test scores Emotional leactions block lational 
thinking, the client can use the scores wisely only aftei he lias come to an un¬ 
derstanding of his emotions 

These points are illustrated in the following dialog from a case record 
(Bixler and Bixler, 1946) 

Counselor Sixty out of one hundied students with scoies like youis succeed in 
engmeeiing About eighty out of one hundied succeed m the social sciences , 

The difference is due to the fact that study shows the college aptitude test to be 
impoitant in social sciences, along with high school woik, instead of mathematics 

Student But I want to go into engineeung, I think I’d be happier theie Isn’t 
that important too? 

C You aie disappointed with the way the test came out, but you wondei if your 
liking engineering bettei isn’t pi etty impoitant? 

S Yes, but the tests say I would do bettei in sociology or something like that 
(Disgusted) 

C That disappoints you, because it’s the sort of thing you don’t like 

S. Yes I took an mteiest test, didn’t P What about it? 

C You wonder if it doesn’t agree with the way you feel The test shows that 
most people with youi mteiests enjoy engmeeruig and are not likely to enjoy social 
sciences— 

S (Intenupts) But the chances aie against me in engmeeiing, aien’t they? 

C It seems pretty hopeless to be interested m engmeeiing undei these condi¬ 
tions, and yet you’re not quite suie 

S No, that’s light I wonder if I might not do better m the thing I like—Maybe 
my chances aie best m engmeeiing anyway I’ve been told how tough college is, 
and I’ve been afiaid of it The tests are encouraging Theie isn’t much diffeience 
after all—Being scared makes me oveido the diffeience, 

29. At what age is it appropriate for counselors or school psychologists to give a 
child or adolescent information about his abilities'? 
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30. Reread the counselor's remarks carefully Did he at any time suggest what he 
thought was right, or what he approved? Did he disapprove of any idea of 
the client’ 

31 In the dialog quoted, would it be helpful or harmful for the counselor to make 
these remarks’ 

a It's probably better for you to work in an area you like than to follow these 
tests strictly 

b. Most people develop an interest in areas where they do well; you probably 
would learn to like social science if you tried it 

c. If you stay in engineering, you should plan to take a course in remedial 
mathematics 

d It seems to you that it's wisest to work in the field where your chances are 
best. 

32. Which is more likely to be threatening, a report on a general scholastic apti¬ 
tude test or a report on a battery like the DAT? 

Fact-Centered Counseling 

Although emphasis has been placed on nondnective counseling above, it 
should not be assumed that picsciiplive methods aie obsolete They aie 
widely used undei many cncumstanccs Some counsclois piefer them. Ad- 
ministiative leqmioments often loice a counseloi to take lesponsibihty foi 
decisions, as when a veteians’ counseloi is lequircd by law to appiove the 
vocational plans ol ccitain tiainees When a case is lefened foi counseling, 
rathei than coming in voluntanly, the counseloi cannot stick to clienl-cen- 
teicd methods. Cases in which the client is incapable of self-dn ection must 
also be piescubed foi 

Those using tests piescnptively emphasize the impoilance of “objective 
facts” as a basis foi lational decision, m contiast to Rogers’ emphasis on the 
emotional meaning of the tacts The piescnptive counseloi tends to think of 
the client as leaning on someone foi dnection, and consideis tests an espe¬ 
cially sound basis foi giving the ducction sought, in othei cases, the pioblem 
of counseling is to convince the client that his plans should be changed, and 
tests aie icgaided as a foiceful type of evidence (Staff, Advisement and 
Cuidancc Service, 1946). The counseloi who wishes to bung his client to 
face the facts takes a stand similai to ]ohn Dewey’s (paiaplnased here fiom 
a passage dealing with childien, 1938, pp 84-85): 

The suggestion upon which chenls act must in any case come from 
somewheie It is impossible to undeistand why a suggestion fiom one 
who has a laigei cxpenence and widei honzon should not be at least as 
valid as a suggestion ansing fiom some moie oi less accidental source. 
It is possible of course to abuse the office, and to force the activity 
of the young into channels which expiess the counseloi’s purpose rather 
than that of the client But the way to avoid this is not for the counselor 
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to withdraw entirely . . . The counselor’s suggestion is not a mold for 
a cast-iron result but is a starting point to be developed into a plan 
through contributions from the experience of all engaged m the coun¬ 
seling process 

Prescriptive counselors geneially obtain a vanety of information, make an 
interpretation, and bring the client to act on this information. While they re¬ 
spect the right of the client to choose between alternatives of merit and do 
not force even a wise course of action upon him, their emphasis is on keeping 
the client from making errors Williamson (1939, pp 134-138) puts the 
position this way 

The effective counselor is one who mduces the student to want to 
utilize his assets m ways which will yield success and satisfaction 
Oidmaiily the counselor states his point of view with definiteness, at¬ 
tempting thiough exposition to enlighten the student In lcspect 
to no student’s problem does the counselor appear indecisive to the ex¬ 
tent of permitting loss of confidence in the authonty of his infoimation 
. If it is true that the counselor should not make the student’s deci¬ 
sion, it is equally hue that someone must render this very seivice until 
some students are able, intellectually and emotionally, to think for them¬ 
selves, 

In helping the client make decisions, the counselor, whatever his tech¬ 
nique, wishes the client to have a basis for optimism The nondirective 
counselor would prefer that this come through insight, whereas the directive 
counselor tends to give direct encouragement In either case, however, the 
client should leave the counseling with a positive plan for action, lathei 
than merely with the knowledge that his former plan was inadequate Sim¬ 
ilarly, he must have a feeling that he has some strong qualities, lather than a 
total feeling of failuie because tests have biought to light only weaknesses. 
In every test performance, theie are some praiseworthy aspects The coun¬ 
selor who wishes to give support will call attention to such featuies as ac¬ 
curacy, originality, or persistence, in addition to giving the client facts about 
his scoie Nearly all counselors working with normal late adolescents and 
adults agree in giving the client the facts on which recommendations (if 
any) aie based The counselor who refuses to give scores even in geneial 
form sets up a feai m the client that he was not told because his scores were 
too poor 

The most helpful smgle principle m all testing is that test scores are 
merely data on which to base further study. They must be coordinated with 
background facts, and diey must be verified by constant comparison with 
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otheiSivailable data This is the reason that continued counseling by an ad¬ 
viser over a year is moie effective than “one shot” counseling where an an¬ 
swer is given to each new specific pioblem by a different adviser. The test 
score helps the counseloi by warning him to look in the recoid for further 
symptoms of a particulai pioblem The scoie, and study of items within the 
tests, suggest topics to piobe by mteiview methods. While sometimes it is 
necessary to act on a problem immediately, it is sound practice to defer a 
final decision as long as possible, meanwhile seeking confirmation of tenta¬ 
tive diagnoses 

33. Discuss the advisability of delaying final decision in each of these situations. 
What supplementary information should be sought to confirm the tentative 
conclusions 7 

a A college student who is failing in engineering at midterm seeks a more 
suitable vocational goal Aptitude and interest tests suggest journalism 
b. An engaged couple, after a quarrel, seeks the help of a marital counselor. 
A personality test intended to predict marital ad|ustment (validity 50) 
shows that their score as a pair is low, in the range where there is an even 
chance of divorce 

e. Students applying to enter a graduate school for social work are tested 
routinely A girl shows severe neurotic signs on both a questionnaire and a 
subtle, moderately dependable personality test 


Suggested Readings 

Bennett, George K , & otheis Counseling from profiles New Yoik Psychological 
Coipoiation, 1951 

This booklet piesents a general discussion of the DAT and a philosophy of 
counseling, then discusses thnty cases showing a vanety of realistic problems 
wheie aptitude piofiles are useful. 

Boidm, Edwaid S Test selection and mterpietation and Illustialions and prob¬ 
lems Psychological counseling New Yoik Appleton-Century-Crofts, 1955. 
Pp 262-331 

Boidm amplifies lus view that tests imposed on the client without adequate 
preparation may delay impiovement, and shows by exLiacts from interviews 
how skilled counselois deal with such pioblems as the client who expects tests 
to make decisions ioi lum, and the client who has been foiced into counseling. 

Lamke, Tom A , & Nelson, M. J Single-score tests vs factor-scoie tests Examiner’s 
manual, the Iienmon-Nelson Tests of Mental Ability Boston Houghton MifHm, 
1957 Pp 19-22 

The Henmon-Nclson lost senes yields a single measuie of geneial ability. 
When it was levised, this section was added to the manual to explain why the 
authors had not shifted to the multiscore pattern The authois’ view that dif- 
feiential testing has Little oi no advantage ovei single-score testing should be 
compaied to the views expressed m the Super leference, below. 

Supei, Donald B (ed ) The use of multifactoi tests vn guidance Washington 
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Other Special Abilities 


THE tests discussed in the piecedmg chapter are the ones most often used 
m guidance. The piesent chapter descubes other tests of special abilities in¬ 
cluding those for psychomotoi and artistic aptitudes 

PSYCHOMOTOR ABILITIES 

The only psychomotor performances considered to this point are the sim¬ 
ple speed and dexterity measures of the GATB Many tests using more elab- 
oiate apparatus and mcasuiing more complex abilities have been tiled, and 
many have shown piechctive value Since the tests are costly to construct, 
maintain, and admimstei, their use is largely confined to industrial and mili¬ 
tary classification. 

The costliness of psychomotoi testing, combined with the difficulties of ob¬ 
taining adequate critciia of occupational success, has discouraged research 
on motor abilities Oui knowledge iests almost entirely on a few research 
progiams, of which by far the most significant has been that of the Air Force, 
which has large samples of men, excellent equipment and control of testing 
conditions, and supenoi critenon data (Melton, 1947; Fleishman, 1956). 

All psychomotor tasks involve intellectual abilities such as are found in 
pencil-paper tests Many appaiatus tests aie correlated with factors P, S, and 
Mechanical Expenence, as well as with strictly psychomotor factors. We 
shall concentiatc here on the uniquely motor abilities and the tests which 
measure them We shall describe a number of illustrative tests before turning 
to a factoi-analytic classification of motor abilities 

Simple Performance Measures 

Reaction Time Measurement of reaction time goes back to the earliest days 
of experimental psychology Tire techniques used today differ only in ele¬ 
gance of instrumentation from some of the procedures Wundt and Cattell 
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introduced m the first psychological laboratory at Leipzig The subject is 
told to react to a light or other signal as quickly as he can When he presses 
the response button, an electrical timer records the inteival that elapsed be¬ 
tween signal and lesponse 

Modem apparatus can present a whole senes of stimuli, record times, and 
cumulate the score—all automatically. The signal apparatus is programmed” 
by a tape 01 a cam so as to piesent signals at megular intervals Such auto¬ 
mation is important for tests involving complicated stimulus patterns, as m 
measures of discriminative reaction time, because it speeds up testing and 
reduces the vanation m testing proceduie 

Although it has often been thought that simple reaction time is relevant to 
automobile driving and to many jobs, consistent evidence to suppoit this 
view is lacking, Simple reaction is a different mattei entirely fiom leaction 
with judgment A test of discriminative reaction time, where a different but¬ 
ton must be pushed for each pattern of light signals, correlates only about 
30 with simple leaction time (Melton, 1947, p 102) Most practical pei- 
formances probably depend moie on choice reaction than on simple reac¬ 
tion 

Steadiness and Simple Controlled Movement Steadiness is required where 
one must maintain a fixed posture oi must trace a pattern accurately Pos- 
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FIG 54 An Air Force steadiness test (Fleishman, 19545 

tural steadiness can be tested by recording movements of a platform on 
which the subject stands Arm steadiness is tested by requiring the subject to 
hold a stylus outstretched m a small aperture without touching its sides The 
stylus and base plate are connected electiically, and each contact is legis- 
terecl on a counter, 
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So-called “aiming” tests mvolve quick, precise eye-hand coordinations 
Aiming may also be measured by a stylus-and-hole apparatus The subject 
is required to thiust the stylus into successively smaller holes without touch¬ 
ing the sides, or into holes momentarily uncovered by a rotating shutter A 
pencil-papei veision of this test lequires the subject to place dots m small 
circles as fast as he can; this test involves motoi speed as well as precision 
of movement. 

Tests of aiming and steadiness had negligible validity for selection of pilots 
and bombaidiers Ann-hand steadiness is related to success of aircraft elec- 
tncians, according to one study Several studies have found veiy high correla¬ 
tions between aiming or steadiness tests and rifle marksmanship (Hum¬ 
phreys et al, 1936). 

1. Decide which type of steadiness test would be most promising for selecting per¬ 
sons for each of the tasks listed below If none of the tests mentioned above 

seems fully suitable, attempt to describe one more comparable to the job. 

a. A pgsaw operator is to move a board, about eight inches square, so that a 
curved pattern is cut out 

b. A rifleman must hold his sights steadily on a target while resting on an elbow 
in a prone position 

c. A pistol marksman must hold his sights steadily on a target while standing. 

d An engraver must follow a pattern with great precision, using a small power 

tool 

Speed and Dexterity We have alieady encountered speed of movement in 
the USES tests, where it enteis into scores K, F, and M. The nearest to a pure 
measure of movement is the Maik Making test (p 274) In Table 39 we 
noted that the manual factor M, involving speed and dexterity, correlated 
.30- 55 with success in many jobs, having a notably high corielation with suc¬ 
cess of persons mounting wires m radio tubes Factoi K has equally large 
correlations with such jobs as typing, telephone opeiatmg, packing, and out- 
boaid motor assembling In general, motor speed is important in over¬ 
learned routine tasks 

The manual and finger dexterity tests of the USES require simple rapid 
movements Some other tests require more complicated movements—for 
example, mseitmg pins into nanow holes with tweezers, or threading nuts 
onto holts Low-to-moderate positive correlations are leportcd for dexterity 
tests as piedictors of office and factory jobs 

Complex Coordinations 

Instead of the fairly simple tasks described above, one can ask the sub¬ 
ject to do quite complicated acts There is little rationale to guide in the de¬ 
sign of these complex tasks, and a good deal of apparatus testing has been 
based meiely on hunches 
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One principle that has often worked well is that of the job replica. If we 
are selecting workers to perform a particular job, we might observe a work- 
sample, i.e., we might observe them briefly on the job itself and lecord 
their output. If the job requites training after selection 01 uses expensive ap¬ 
paratus, however, the true worksample may be impractical. In such a case, 
the tester tries to design an apparatus which reproduces much of the 01 igmal 
task, without requiring skills that have to be developed during job training 

An excellent example of the job replica is the Complex Coordination 
test of the Air Force (Figure 2). One cannot obseive a would-be pilot m an 
airplane, but the Complex Cooidmation test gives him a stick and rudder 
bar which he is to move much as the pilot does Movements aie dictated by 
signal lights. When a light appeals at the top of die left center column, the 
man pulls die stick so that the right centei light will move upwaid to match 
it A sideways movement of the stick controls the light m the top row, and 
the rudder controls the light lunmng across the bottom row 

This test had a validity of about 40 for predicting pilot success and was 
given the highest weight among all tests used m the selection battery. A fac¬ 
tor analysis demonstrated the leason for this high validity die Complex 
Coordination test duplicates better than any othei test the common-factor 
composition of the student pilot’s task. Table 41 (cf Figuie 44), gives the 


TABLE 41 Factor Loadings of the Complex Coordination Test and the 
Pilot Success Criterion 


Factor 

Loadings of 

Complex Co- Graduation- 
ordinahon Elimination 

Test Criterion 

Product of 
Loadings 

Spatial 

49 

.34 

.167 

Psychomotor coordination 

34 

22 

075 

Mechanical experience 

20 

26 

052 

Interest in piloting 

.17 

28 

.048 

Visualization 

17 

25 

042 

Perceptual speed 

17 

15 

026 

Numerical 

.09 

.01 

001 

Verbal 

- 01 

- 02 

000 

Reasoning 

02 

- 02 

- 000 

Uninterpreted factor 

.10 

- 03 

- 003 

408 


Sounds. Melton, 1947, p 995 


loadings of the test and of the criterion (graduation from pilot training vs 
elimination). The products of die loadings, in die farthest right column, show 
how much each factor adds to die total validity The total of these products 
agrees almost exactly with the observed validity of 39 The piedictive valid¬ 
ity is accounted foi entirely by the common factors, which means that the 
specific content of the Complex Coordination test does not contribute to 
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prediction of pilot success Note that the spatial aspect of the test accounts 
for more of its predictive value than does the coordination factor. 

Anothei type of complex task calls for “pursuit” or "tracking,” i.e, follow¬ 
ing an nregulai couise 01 a moving taiget as in gunneiy, radai opeiation, 
and high-speed maneuvering The essence of a pursuit test is a moving target 
which must be followed with a pomtei of some kind Four pursuit devices 
are shown in Figure 55 The Rotaiy Puisuit Test is the simplest and the 






FIG 55 Four pursuit or coordination tasks (Fleishman, 1956) Except for Rudder Control, 
which is large enough for the subject to sit in, the tests are of desk-top size 

oldest. It has been used as a piedictive device and as a laboratory instru¬ 
ment for the study of skill learning A 31-inch brass disk is set m a bakelrte 
turntable. The subject uses a stylus with a lunged handle to follow the disk, 
his total contact time being lecorded electrically Many valuations are pos¬ 
sible. In the Pursuit Confusion Test, the speed of the target changes, and 
the subject has to guide his tracking by watchmg in a mmoi lathei than by 
viewmg the target directly. The Two-Hand Coordination Test involves 
slowei but more complex movement One handle contiols left-light motion 
of the follower arm, while tire othei conti ols front-to-back motion Both must 
he moved at the same time, at different speeds, to stay on the target. The 
Rudder Control Test is anothei job replica, which has the honor of being the 
only psychomotor test invented during World War II which proved valua- 
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hie enough to put into immediate use for pilot selection The man sits in a 
cockpit and is to keep the cockpit pointed directly toward whichever of 
three target lights is lit. The direction of the cockpit is contioiled both by the 
rudder and by the man’s posture, so that this becomes a test of bodily bal¬ 
ance as well as leg coordination. 

Complex ]ob replicas are of considerable value, provided they truly re¬ 
semble the job to be predicted An example is the so-called Metal Filing 
Worksample, intended to measure that skill as used in dentistry This isolates 
one element of the job and measures it directly It correlated .53 with 
grades in dentistry courses (Bellows, 1940). The IER. Trimming Test, m 
which the subject cuts between a pair of nairowmg lines with scissors, cor¬ 
related .69 with latmgs of power-sewmg-machme trainees (Treat, 1929). 
The Hand-Tool Dexterity Test, lequmng operations on nuts and bolts with 
wrench and screwdnver, correlated 46 with peiformanee of machinists 
(Bennett and Fear, 1943). 


TABLE 42 Prediction of Quality and Quantity of Work of Sewing-Machine 
Operators 


Test 

Correlation with 
Quality Criterion 
(N = 52) 

Correlation with 
Speed Criterion 
(N = 52) 

Minnesota Clerical, Names 

36 

08 

Minnesota Clerical, Number 

26 

22 

Poppelreuter Tracing (time score) 

-.31 

.45 

Poppelreuter Weaving 

27 

21 

Paper folding 

.30 

- 10 

Minnesota Spatial Relations (time) 

24 

28 

Minesota Paper Form Board 

32 

17 

O'Connor Tweezer Dexterity 

07 

46 

O'Connor Finger Dexterity 

.20 

27 

Minnesota Rate of Manipulation 

08 

.31 

Otis Self-Administering (IQ) 

17 

11 

Tests with correlations m boldface, combined 

57 

64 


SotmcE I L Otis, 1938 


In line with the suggestion that good predictors resemble the job, J L. Otis 
(1938) found that tests which predict quality on a job may be poor predic¬ 
tors of speed Correlations of predictive tests for sewmg-machine operators 
with both speed and quality catena are shown m Table 42 Otis points out 
that workers suitable for a shop stressing quality may lack aptitudes needed 
m a shop seeking high volume of production The usei of psychomotor tests 
must have clearly m mind the nature of the job he wishes to pi edict 
The difficulties of interpreting psychomotor tests which arise from their 
specificity and from the shortage of relevant psychological theory should not 
lead the personnel psychologist to underrate them When the Au Force 
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made a senous effort to use them on a large scale, they turned out to be 
not only piactical but nearly indispensable. Their contribution to pilot selec¬ 
tion was about equal to that of the much-better-understood printed tests. 

2. The factor loadings for Complex Coordination, squared, give the percentage 
composition of the test variance. How do the values from Table 41 compare with 
those in Figure 44, p 254' ? (The two results are based on different groups of 
student pilots, and a larger number of tests was used in arriving at the results in 
Figure 44) 


Factors in Psychomotor Performance 

The extensive appaiatus testing program of the Air Force continued fiom 
1942 to 1955 Duiing the piogram, Ail Force psychologists collected data of 
an unprecedented type, giving batteries of reliable apparatus tests to laige 
samples of men The- results piomise to bring some order into the chaotic 
theory of psychomotoi testing 

Hitheito, it was necessaiy to describe each test m turn No basic list of abil¬ 
ities had been isolated, and factorial results had been incomplete and paitly 
contradictory Fleishman, on the basis of his lccent work, now offeis a list of 
factors which may account for much of the psychomotor domain. (This 
summary comes from Fleishman, 1956 Some of the original factoi analyses 
are reported in Fleishman and Hempel, 1954b, 1956; Hempel and Fleish¬ 
man, 1953, and Fleishman, 1954 See also Fleishman, 1953 ) The list must 
be regai ded as tentative, however, until it is cross-checked by woik outside 
the Air Force It seems fair to say that Fleishman has brought psychomolor 
testing to about the point that intellectual testing leached m 1940, following 
Thurstone’s fust lepoit on the “primaiy abilities.” Views in that field have 
changed extensively smce 1940, and time will no doubt altei Fleishman’s 
list 

Some of Fleishman’s factois are old acquaintances like finger dexterity 
Others repiesenl distinctions nevei previously suggested Tire list of major 
factois is as follows. 

Reaction lime Quickness of a simple, overlearned movement m re¬ 
sponse to a signal 

Arm-hand steadiness Piecision and steadiness in positioning move¬ 
ments Speed and stiength irrelevant. 

Rate of arm movement Speed of gross arm movements Precision irrele¬ 
vant The only test we have described which seems to measure tins factor 
is the Place test of GATB 

Finger dexterity Skillful, controlled finger movement The GATB As¬ 
sembly and Turn tests have loadings on this factor. 
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Manual dexterity, Skillful, controlled movements in manipulating larger 
objects with whole hand, The Turn test is one of the better measures of 
this factor 

Postural discrimination. Making precise bodily adjustments on the basis 
of postural cues Walking a rail blindfolded would probably be a good 
test. The experimental measure of this factor is a test wheie the subject 
is seated blindfold in a tilted chan and must push buttons to bring the 
chair upright. 

Fine psychomotor coot dination. Also called "fine conti ol sensitivity.” 
Delicate, hig hl y conti oiled adjustments involving laige-muscle groups, as 
in Rotary Pursuit and Pursuit Confusion 
Multiple limb coordination Using two arms, arms and legs, etc, m a 
simultaneous control movement such as clutchmg-and-shiftmg an auto¬ 
mobile transmission This is measured in Complex Coordination and Rud¬ 
der Control 

Rate control Involves continuous anticipations and adjustments of tim¬ 
ing in tracking a target with variable speed and path 

Response oiientation Choosing the pioper response among seveial al¬ 
ternatives. This has been tested by complex discrimination tasks where 
each signal pattern calls for movement in a diffeient direction. Can be 
measured by pencil-paper tests as well as by apparatus. 

Response integration. Combination of information into a single inte¬ 
grated motor response. Two-hand Coordination and Complex Coordina¬ 
tion involve this ability. 

In addition to this main list, a number of sheer physical factois such as 
“strength” are found, and a number of lesser psychomotor factors which are 
not yet well established. 

Psychomotoi tests which involve different factors often have very low 
correlations. (For example, Rate of Manipulation—Placing con elates only 
.02 with Rudder Control ) This definitely rules out the idea of a geneial 
psychomotor ability which makes some people good at any manual or ath¬ 
letic task. 

3. Which factor involves complex movement of small-muscle groups, with little 
emphasis on speed 3 4 5 * * * 9 

4. Which factors do you think are involved in each of these tasks? 
a Riding a bicycle. 

b. Typewriting 

c Cutting dress materials, following a pattern 

5. These items are included in the MacQuarrie Test of Mechanical Ability All are 

given with short time limits. Which factors does each seem to measure? 

a. Dotting, % G -'nch circles, irregularly spaced Place one dot in each circle. 

b. Tracing, a senes of 1-inch vertical lines, each with a j^g-inch opening some¬ 

where along its length. Trace a path through the openings 
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c. Tapping; %-inch circles regularly spaced Put three dots in each circle 

6. The Purdue Pegboard requires the subject to place small pegs into holes, first 
With his right hand, then with his left hand Loadings for the right-hand and left- 
hand scores, respectively, on various factors were as follows, reaction time, .25, 
.02, arm-hand steadiness, 14, .06, rate of arm movement, 22, 13, finger dex¬ 
terity, .46, 58 What explanation can you give for the differences observed be¬ 
tween right- and left-hand scores? 

7. The correlation of visual and auditory reaction time is only 56 How can the 
gap between this value and the reliability of .85 best be explained? 


GENERAL PROBLEMS OF PSYCHOMOTOR TESTING 
Apparatus Differences 

Test appat atus is supposed to be slandaidized, especially when results at 
one time and place set standaids to be used in futuie selection or guidance 
The An Foice found that even when seveial pieces of appaiatus were made 
m the same shop fioin the same blueprints they weie laiely equivalent 
Moreover, each appaiatus changed ovei time as electucal contacts became 
diity, mbbci pints became less elastic, and so on For example, in the rela¬ 
tively simple Arm-Hand Steadiness Test the mean scoies earned on four 
diffeient pieces of appaiatus weie 227, 230, 260, and 291 (Melton, 1947) 
These diffeiences aie of piactical importance, since the standard deviation 
of scores is about 120 points A score which was aveiage on one machine 
would be near the 30th peicentile on another. 

Pencil-Paper Measures of Motor Performance 

Apparatus tests aie virtually out of the question in guidance testing, and 
it has raiely been piactical to use them m industrial and military selection 
Initial cost is not the cntical factor, Melton estimates that $250,000 covered 
the total cost of appaiatus for processing tens of thousands of An Force men 
The big cost is in time of the peisons who must give the tests, for even with 
highly efficient an alignments a leslci can handle only four to six subjects at 
once So long as appaiatus tests pi edict validly, they can save enough to 
more than lepay then costs In the An Force, every man who failed pilot 
training leprcscnted a waste of $25,000, This easily justified using expensive 
tests to detect failuie in advance, It is nevertheless obvious that testers 
would gladly substitute pencil-paper tests if these would measure the same 
aptitudes 

We have seen m the GATB Mark Making test an example of a pencil- 
paper psychomotor test. Some psychologists are convinced that with suffi¬ 
cient ingenuity other important motor abilities can be reduced to group 
pencil-papei tests. The evidence on the question is extremely fragmentary. 
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One Air Force study (Melton, 1947, pp 1033 ff ) found that when appara¬ 
tus tests and pencil-paper tests of motor speed weie put m the same battery, 
the two groups defined quite separate abilities Such small efforts as were 
made during the war to obtain validities for pencil-paper psychomotor tests 
weie discoui aging More recently, Fleishman (1954) introduced several 
printed psychomotor tests mto a battery and found that they were quite suc¬ 
cessful in measuiing wrist-finger speed and fanly successful in measuring 
aiming and steadiness The pencil-paper tests had little m common with the 
more complex coordination and dexterity tests 

Reliability 

As in the case of performance tests of general ability, reliability has been 
a source of difficulty in psychomotor tests It will be recalled that in the 
GATB, F and M are the least reliable scores Reliabilities foi apparatus tests 
as usually given are m the neighborhood of .70 This level may be satisfactory 
when the test is to be combined with several others in an oveiall prediction, 
but it makes the test untrustworthy by itself. 

One might reasonably suppose that extending the test penod would raise 
reliability, and if so, only cost of testing prevents us fiom boosting reliability 
just as we would for a pencil-paper test The reliability of apparatus tests 
does not increase with length in the normal manner, however, because two 
successive sections aie not “equivalent ” This is best shown with the Rotaiy 
Pursuit Test, where the internal consistency of a ten-tnal scoie is 97, but the 
correlation of the first ten with the second ten trials is only 84 For all the 
Air Force tests the same thing was found reliability increases with length, 
but more slowly than the Spearman-Brown formula pi edicts 

The reason is that the test measures different things at different stages of 
practice The first and second ten trials are outwardly similar, but psy¬ 
chologically they pose different tasks for the subject To that topic we now 
turn 

Changes in Meaning with Stage of Practice 

In an aptitude test it is important to obtain a stable measuie, characteristic 
of the person over a period of time Scores are unstable if we apply a psy¬ 
chomotor test without giving the subject a chance to learn the task in pre¬ 
liminary trials This is again an example of the principle that abilities, while 
emerging, cannot be accurately measured. 

On complicated testing devices, a subject cannot show his full ability un¬ 
til he has become familiar with the reaction required. Fleishman and 
Hemp el (1954a, Fleishman, 1957) gave 64 two-imnute trials on the Com- 
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plex Coordination Test (Eight minutes is the usual testing time ) The scores, 
togethei with reference tests, were factoied Figure 56 shows that the factor 
content of the test depends on the amount of practice That is, different men 
score high at diffeient stages of piactice. In the eaily stages, cognitive 
factors such as S and Vz aie most important, along with Psychomotor Coor¬ 
dination These mteipietative factors account foi little variance after sub¬ 
jects become familiar with the task Psychomotor Cooidmation, a factor 
common to this test and to other motor tests admimsteied, mci eases m im¬ 
portance dunng the first 40 minutes of piactice but then diops back. Two 



FIG 56 Composition of Complex Coordination Test as a function of practice, (Data from Fleish¬ 
man and Hempely 1954a) Curves show proportion of variance accounted for by each factor after 
removing error variance and unidentified minor factors from consideration Curves have been 
smoothed, and factors have been combined as follows Cognitive includes Spatial, Visualization, 
and Mechanical Experience, Motor Speed includes Rate of Movement and Psychomotor Speed 

factois glow steadily in piommence late of movement and a specific factor 
which we shall discuss liulhei m a moment Evidently the eaily trials meas¬ 
ure the person’s adaptation lo a new task, and intellectual factois play a large 
part in the vananco At the end, sheei speed has become one of the leading 
sources of individual differences 

The substantial specific factoi found in the Complex Cooidmation scores 
and not m othei appaiatus tests, such as Rotaiy Puisuit, becomes the hugest 
souice of individual diffeiences after the first houi of piactice This factor in¬ 
volves some difference among individuals generated dunng the test and 
presumably coiresponds to particular coordinations or work methods which 
the individual develops as he practices (Stevens, 1951, pp 1341-1362). If a 
man with good general aptitude happens to fall into a bad habit, his final 
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score may be far below his potential. Specific bad habits are highly persistent 
from trial to trial, as any athlete knows They are not just a matter of low apti¬ 
tude, the professional coach makes a career of lecognizing and eliminating 
such faults fiom the performance of talented athletes. Not all specific re¬ 
sponses are haimful, sometimes one stumbles mto a fortunate pattern or 
rhythm which gives him a higher score than his aptitude would have pre¬ 
dicted The specific factor built up only in the Complex Coordination Test is 
unlikely to be of much piedictive value There is no ieason to think that a 
man who acquires a certain bad sequence of actions on this test will foim a 
similar habit when learning to fly a plane In learning that task, other specific 
habits will develop 

Stability 

When enough lesearch has been done to provide a basis for practical inter¬ 
pretations of the factois, measuies of these factois may be able to play a ma¬ 
jor role in vocational guidance Before tests can be used to make long-range 
predictions, howevei, research on the stability of psycliomotoi aptitudes will 
be necessary. Theie is considerable evidence of stability of scoies over peri¬ 
ods of a few months, but little is known about stability ovei sevei al yeai s. For 
guidance of adolescents, it is necessary to know at what age psychomotor 
abilities begin to stabilize The only substantial follow-up data so fai avail¬ 
able come from a study made in a Texas high school, where the same 114 
pupils were tested with the GATB each year (Table 36). These data, though 
limited, suggest strongly that simple psychomotor abilities stabilize at about 
the same age as intellectual abilities. 

Validity 

Any verbal or intellectual test is likely to predict many criterion tasks A 
numerical reasoning test may predict English, mathematics, science, and 
shop grades, all equally well, simply because all the measuies depend on 
general ability Despite popular legends about boys who aie “good with their 
hands,” there is no such general psychomotor ability. Any psychomotoi test 
is much more valid for some jobs than for others Psychomotor tests must be 
chosen for each particular job They can be trusted as predictors only after 
an empirical tryout Theie have been many disappointing studies in which 
tests which “ought to have” predicted occupational criteria failed to do so 
(Stevens, 1951, pp. 1341-1362). 

In trying out tests with the hope of finding a good predictor, past experi¬ 
ence suggests the job replica as the best bet. The common-sense rule that the 
test which resembles the job ought to predict the job has generally paid off 
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with encouraging validity coefficients. I£ intelligently designed, the job rep- 
lica can demand the same coordinations, speed, and piecision as are called 
for in the critenon task. Furthermore, it looks like a reasonable test and 
therefore appeals to the subject and to the employer m whose behalf the tests 
are used. There is little doubt that job replicas will continue to be used m in¬ 
dustrial and mihtaiy selection foi some time 
Some of the lesults foi the Complex Coordination Test raise a serious ques¬ 
tion as to whethei the job replica deseives its good reputation. Much of the 
validity of that test m pilot selection was due to intellectual abilities which 
could be measuied m pencil-paper tests The psychomotor aspect of the test 
which predicted the cutenon was the cooi dination factor common to several 
apparatus tests which weie not replicas of the pilot’s job Rotary Puisuit, 
which does not “look like” anything the pilot does, was a bettei measure of 
the coordination factoi than Complex Cooidmation If theie is anything spe¬ 
cial about the Complex Coordination Test as a leplica of the pilot’s job, it 
must be m the specific factor, and that conti lbuted nothing to validity So the 
magic of the Complex Cooidmation Test appears to be that it involves about 
the right mixture of common factois. If so, such a mixture could be put to¬ 
gether by adding scores from othci tests measuimg the same factors 
One of the leasons foi wishing to avoid the job-iephca pimciple in test de¬ 
sign is that it leads to an endless process of inventing and levismg tests to 
cover additional jobs 01 to take changes in job requirements into account 
Vocational guidance could not possibly be based on such tests, since hun¬ 
dreds of tests would have to be given to cover the occupational spcctium 
Fleishman believes that it will be possible to piepaie a short battery to 
measuie the chief psychomotoi factois, and to combine scores fiom this bat¬ 
tery so as to pi edict the psychomotoi component of any job. At present, no 
one can say how well the factor scores will predict jobs, and one cannot guar¬ 
antee that the list of psychomotoi abilities will lemain short The factors do 
account for about half of the vanance in cuiient tests, therefore combina¬ 
tions of the factor scoies may be able to do much of what the ongmal tests can 
do, While psychomotoi abilities arc often of value m piedicting occupational 
performance, they aie geneially of less impoitance than intellectual abilities 
In the USES studies, tlieie were relatively few occupations whei e motor fac¬ 
tors were substantially bellei predictois than the nonmotoi factois. These ex¬ 
ceptional occupations fall into two bioad categones. bench work 01 assem¬ 
bling (cheese wiapper, telephone diaphragm assembler, paper-pattern 
folder) and manipulative machine opeiatmg (machine clothes presser, bag 
sealer) Motoi tests have excellent validity for these routine jobs, but as soon 
as a job becomes less repetitive, peiceptual and intellectual factors make 
large contributions Ghiselli’s study (1955) of other published coefficients 
supports this conclusion, but adds that motor tests make a substantial con- 
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tribution in predicting success of structural workers. Tiffin (1952, p. 126) 
comments as follows on the failure of motor tests to predict skilled work (see 
also Patterson, 1956). 

A consideration of the skills demanded of the industrial tradesman or 
skilled machine operator indicates that this employee usually succeeds 
or fails in proportion to his training and geneial mechanical compre¬ 
hension, not in proportion to his basic dextenty. This fact does not mean 
that successful tradesmen do not need skilled movements, but it does 
mean that such muscular coordination as may be needed can be devel¬ 
oped by the majority of tradesmen in tiammg and that it is lack of me¬ 
chanical comprehension rathei than inability to develop the muscular 
aspects of the job, that may prevent them fiom becoming really profi¬ 
cient in this line of work. This implies that only in the most repetitive 
performance can psychomotor tests alone provide an adequate basis foi 
selection In complex jobs where psychomotor tests do not pi edict ulti¬ 
mate proficiency, they nonetheless may make a valuable conti lbution by 
identifying which persons can mastei the motor components most rap¬ 
idly 

8. What function might psychomotor tests play in making school physical educa¬ 
tion programs more profitable? 

9. Experimenters wish to study the effect of vitamin lack on motor performance. 
They plan to test a group, then alter the diet, and test again after some time.' 
Would it be desirable to offer training on the tests before the first measure¬ 
ment? 

10. Under what circumstances would extending a psychomotor test so that it meas¬ 
ures fatigue or endurance lower its validity? 

11. Bennett and Cruikshank (1942) say, “Vocational guidance (not selection) on 
the basis of motor skill alone is quite deplorable, except in the case of individ¬ 
uals who have gross incapacitating motor disabilities ” Do you agree? 


ARTISTIC ABILITIES 

General mental tests measure ability to succeed in courses, but a high score 
does not guaiantee creativeness even m strictly intellectual work In paint¬ 
ing, architectuie, and other graphic arts, special talents must certainly play a 
large part m success To identify such talents has pioved exceedingly diffi¬ 
cult, but several tests of artistic ability have been hied. (In addition to these 
approaches, some effort has been made to predict from personality tests ) 


Worksamples 

One way of identifying those who will do well m art training is to obtain a 
sample of the person s creative drawmg. This may measure training just as 
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much as talent, but it is a fair basis for comparing persons with similar train¬ 
ing Meiely asking the subject to chaw or paint a pictuie, or to submit a piece 
of completed woik, howevei, does not make foi a standardized comparison 
To standaidize the task by lequnmg eveiyone to draw fiom the same model, 
on the othei hand, leaves no room foi creativeness 



FIG 57 Specimen item from Horn Art Aptitude Inventory, showing stimulus lines and two 
drawings based on them (C A Horn and L F Smith, 1945) 


The Horn Ait Aptitude test attempts to solve this problem by a job replica 
calling for high-level cieativeness under veiy slight constiamts In the “ 1 m- 
ageiy” section of the test the subject is given seveial caids, each beaung a 
pattern of lines Aiound these lines he is to sketch a picture The pictures aie 
judged by ait instructors as to imagination and technical drawing quality 
Using careful scoung directions, competent judges can attain a con elation of 
.86 between independent scormgs The other chief section of the test calls foi 
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arrangement of rectangles and otlier simple figures into balanced composi¬ 
tions. 

The test is intended for use with applicants to art school, most of whom 
have had previous training The scores at the beginning of the year corre¬ 
lated .66 with grades in a special art course for high-school seniors (C. A. 
Horn and L F Smith, 1945). 

Gilbert and Ewing also favor the job-replica pnnciple Their unpublished 
Illinois Art Ability Test (based on one type of item in an earlier test by Knau- 
ber) asks the subject to draw ceitam objects (eg., a table) m perspective. 
The di awing is scoied not only for the technical quality of the perspective 
but for the extent to winch the subject has elaboiated oi beautified the ob¬ 
ject A table which shows attractive lines and proportions receives a higher 
score than a graceless one The test lequnes artistic skill, but it also i effects 
cieative effort Scoring lules have been developed which permit cleiks to 
score tire papers objectively, two scorings of the same papers correlate 94 

A validation study on students of architectuie shows several interesting 
facts In Table 43 we see that the test predicts art comses moderately well 


TABLE 43. Prediction of Success in Freshman Architecture Courses 


Test 

General 

Engineering 

Drawing 

Freehand 

Drawing 

Grade Average, 
Other Courses 
Combined 

Illinois Art Ability Test 

26 

42 

.27 

Ob|eCt Aperture Test (spatial) 

.57 

30 

.27 

Cooperative Mathematics Test 

.40 

27 

45 

ACE (general ability) 

40 

.25 

45 

Bennett TMC 

60 

10 

09 

Rank in high-schoal class 



49 


Note The criterion for the drawing courses is a composite of grades and ratings by instructors 
The number of cases varies from 27 to 69. 

Source These are unpublished data from a study by W M Gilbert, T N Ewing, D R Krathwohl ‘ 
and L J Cronbach ' 


but gives a poor prediction m engineering diawmg The latter is predicted 
very well by the TMC and a spatial test The average m other couises, in¬ 
cluding English and mathematics, is best predicted by high-school marks, the 
ACE tests, and a mathematics achievement test. This points to a fact of great 
impoitance in vocational counseling Even though a student possesses 
special aptitudes m high degree, he cannot use them m a profession unless 
his general ability is good enough to carry him thiough general college 
couises Specialized abilities, in fact, play little pait in determining the ar¬ 
chitect’s oveiall freshman grades, where the diawmg courses are outweighed 
by nonspeciahzed courses The TMC and the Art Ability Test correlate only 
12 with the overall grade average 
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Analytic Tests 

The job leplica gives valuable information about students who have had 
some ait training, but it neithei clan lies the natuie of aitistic talent nor gives 
a basis foi compaung untrained peisons. Foi these puiposes, it is necessary 
to test components of aitistic ability pnor to training These components 
have not been adequately identified, and the tests now available aie based 
only on some mvestigatoi’s hunch as to what makes an artist, One such test 
is die Lewerenz Test of Fundamental Abilities in Visual Ait (California 
Test Bureau) Among the aspects of aitistic ability measured aie piefeience 
for designs, di awing a sketch to fit a pattern, locating proper positions of 



FIG 58 Item from the Meier Test of Art Judgment The arrangement of tho woman's burden 
has been changed. Which arrangement is better? (Item copyright 1940, Bureau of Educational 
Research, State University of Iowa Reproduced by permission ) 


shadows, ait vocabulary, reproducing a foim (vase) from memory, correc¬ 
tion of perspective, and coloi matching Practically no information on the 
validity of the test has been published 

The most adequate analysis of ait ability to date is that of Meier (1939), 
who took into account biogiapliies of aitists and experimental test lesults. He 
concluded that six tiails distinguish the artist' fine eye and hand coordina¬ 
tion, energy and concentration, intelligence, keenness of observation, crea¬ 
tive imagination, and aesdietic judgment While Meier planned tests for 
“creative imagination” and “aesthetic perception,” only a test of artistic judg¬ 
ment was actually completed 

The Meier Art Judgment Test (and an earlier form by Meier and C E. 
Seashore) has been more widely used than any other art test. The test meas¬ 
ures taste rather than ability to use art media A “good” work of art is altered 
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in composition, shading, or some other quality so as to damage its aesthetic 
appeal The original and altered pictures are presented to the subject, who 
selects the more pleasing one (Figure 58) If he agrees with experts who 
have taken the test, he gets a high scoie The needed validity studies on this 
test have not been performed, a few studies give favorable but very frag¬ 
mentary evidence 

One difficulty with tests of judgment is that the subject may give, not his 
own opinion, but his guess as to what experts will accept as best An experi¬ 
ment on the Giaves Test of Design Judgment showed that by giving such 
guesses the subject could gam several points over the score his own prefer¬ 
ences would earn (Buros, 1953, p. 336). 

Artistic judgment is distinct from ability to perforin artistically Rose An- 
deison (1951) warns against reliance on the test of judgment as a predictor 
in the fine arts Peisons with poor Meier scores often are judged highly prom¬ 
ising by art instructors Her counseling experience, she says, 

has led to considerable caution in encouragmg clients toward fine-art 
specialization On the other hand, the combined results of several tests 
provide a more adequate basis for appraising potentialities for such ap¬ 
plied fields as advertising art, format, interior decoration, costume de¬ 
sign, and crafts The appropriate combination of supporting aptitudes 
includes superior artistic judgment reflected m a high McAdory score, 
superior facility foi spatial visualization and fine eye-hand coordination, 
manual dextenty, evidence of drawmg ability reflected in the Lewerenz 
Originality of Line Drawing Test [subtest], in the Horn Art Aptitude 
Test, or m work samples 

The McAdory test is a test of taste in furniture, clothing, and automobiles as 
well as in fine art (Rose Anderson, 1948), 

Research on aitistic abilities is still m a most primitive stage No systematic 
research has been done using modern tests and adequate cuteria. Most of 
the tests have been left as they weie when first designed as much as thirty 
years ago, without follow-up research or revision The natme of artistic apti¬ 
tude remains an unsolved—and neglected—problem 

12. What criterion would the Meier test be expected to predict better than the 
Horn test, and vice versa? 

13. What aspects of art aptitude do not appear to be measured by any of the tests 
described? 

14. Are the six traits listed by Meier most likely to reflect talent, training, or tem¬ 
perament? 

15. Assuming that the validities reported in Table 43 are confirmed by further 
studies, what advice would a counselor give an applicant to the school of 
architecture who scores at the 80th percentile in the Art Ability Test, TMC, and 
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Ob|ect Aperture Test but at the 30th percentile in mathematics and general 
ability? 


(social intelligence 


A recurrent mteiest of aptitude testeis has been to identify qualities which 
help one to get along with otliei people It was once suggested that there 
might be tluee categories of intelligence abstract, practical, and social We 
have found some evidence permitting us to distinguish the first two, though 
both of them must be fuither subdivided After fifty yeais of mteimittent in¬ 
vestigation, liowevei, social intelligence lemains undefined and unmeasured 
There are wide individual differences in performance of such loles as 
salesman, Aimy officei, teacher, and psychodrerapist General intelligence 
has something to do with success m most of these assignments, but it surely 
is not the whole stoiy Peihaps the difference between the successful and the 
unsuccessful peifoimei depends chiefly on personality and inteiests Many 
testers have tued to identify an intellectual component m ability to lespond 
successfully to otheis, and these tests meat brief attention 

In 1926, a Test of Social Intelligence was developed by F A. Moss and 
others (published by George Washington University). Theie are four sub¬ 
tests m the revised (1944) version judgment in common-sense social piob- 
lems (e g., “What is the best way to ask a favoi of someone you know only 
slightly?”), matching statements with the emotions expiessed, eveiyday psy¬ 
chological generalizations in true-false foim (“In social lelations, demands 
aie usually more effective than lequests”), and completion of a joke (multi¬ 
ple-choice ). This test measures general oi verbal ability to some degree, but 
there is no evidence that it measures any distinct ability which has practical 
_predictive value Enough attempts weie made to establish the validity of the 
test for selection of salesmen, etc, to indicate that this line of appioach is 
fruitless (R L Thorndike and Stem, 1937) 

The last few yeais have seen a huge numbei of tests of “social sensitivity,” 
“insight into otheis,” or “empathy.” Personality theonsts have argued that 
good personal relations depend upon good communication, and that a good 
leadei oi theiapist is he who is sensitive to the ideas and feelings of others, 
even the unvoiced ones. We measuie A’s understanding of B by askmg A to 
describe some aspect of B and compaung this judgment with an independ¬ 
ent criterion. The most common method, because it is the simplest to apply, 
is to have B fill out a personality questionnaire describmg himself, and to 
have A fill out the questionnaire as he thinks B would The method is capa¬ 
ble of endless variations, depending on whethei A is asked to judge friends, 
work associates, or strangers, on what opportunities he is giyen to observe the 
strangers, and on what questions he has to answer / 
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| This is a job replica which encounteis all the pitfalls of an approach rest¬ 
ing on surface similarity rather than psychological understanding of the vari¬ 
able measured. It is a reasonable assumption that a teacher or therapist or 
leader ought to understand what those he works with are thinking, The test 
design seems like a simple translation into scoiable form of an act which he 
performs every day. Careful analysis of these tests, however, shows that the 
responses do not depend on msight into the individual (Cionbach, 1955b, 
Gage and Cronbach, 1955) No evidence of validity is yet available which 
warrants confidence in any present technique for measuring a person’s abil¬ 
ity to judge otheis as individuals. 

SPECIAL APTITUDE TESTS FOR COURSES AND PROFESSIONS 

About thirty years ago, numerous attempts were made to develop special¬ 
ized aptitude tests for particular school subjects or curricula such as algebra, 
foreign language, engineering, or law. The test was usually piepaied on the 
basis of a superficial analysis of the couise of study. Test problems were 
based on the type of content to be encounteied in the course (e g., a foieign 
language test might involve substituting nonsense symbols foi words m a 
sentence, a legal aptitude test would ordinarily present hypothetical prob¬ 
lems in legal reasoning). 

The tests of this first period have virtually disappeared The pnmary rea¬ 
son is that the introduction of content specially relevant to the couise of 
study did not raise validity appieciably above that which could be obtained 
with a good measure of general ability. When group tests began to provide 
separate scores for verbal, quantitative, and later spatial and mechanical 
comprehension, these broader-purpose tests appeared to offer all the advan¬ 
tages of a special test for particular subjects Prediction ordinarily can rest on 
either a general mental test, a verbal test, or a general proficiency test, al- A 
though there may occasionally be an advantage in considering special abili¬ 
ties also 

When thorough psychological study is made of a type of training which is 
of widespiead importance, it may he possible to discover component abili¬ 
ties not covered m general ability tests. The best example is the Modern 
Language Aptitude Test of Cairoll and Sapon (Psychological Coiporation) 
This test, originally designed to select overseas employees to take special 
intensive language courses, showed validities in the range .60- 75 When 
used with high school students, predictive validities vaiy from school to 
school and fiom course to course For four languages in one high school, 
validities ranged from 53 to 60, these validities are about 20 higher than 
corresponding correlations for general mental ability The test correlates .61 
with general ability, implying that not all very blight pupils are superior at 
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language learning, The test is admimsteied by means of a tape recording so 
as to test the pupil’s eai for strange sounds, his ability to learn new material, 
and his sensitivity to giammatical forms (Caiioll, 1959, Haiding, 1958) 
Many aptitude tests have been developed m recent yeais for use m gradu¬ 
ate and professional schools The tests aie foi the most pait measures of gen¬ 
eral ability or academic achievement, such as might be used for geneial col¬ 
lege counseling They aie adjusted in difficulty to die level of students tested, 
and place extia emphasis on the abilities of obvious mleiest to the profession 
in question. Tins special emphasis may malce the test a slightly bettei piedic- 
toi than a geneial-puipose test. 


TABLE 44. Validation Research on the University of California Engineering 
Examination 



Section 


Part 

Time 

Limit 

Relia¬ 

bility 

Corre¬ 

lation 

with 

Fresh¬ 

man 

Grade 

Aver¬ 

age 

Decision 

Regarding 

Test 

1 

General 

i 

Word meaning 

15 

.93 

.11 

Study further 


scholastic 

2 

Verbal fluency 

10 

63 

— 

Drop 


ability 

3 

Figure classification 

30 

.73 

12 

Revise 



4 

Technical voeabu- 








lary 

20 

.95 

.27 

Use 

II 

Mathematical 

5 

Quantitative in- 






reasoning 


ference 

70 

91 

.39 

Use 



6 

Numerical pro- 








gression 

20 

.89 

.11 

Drop 

III 

Scientific 

7 

Understanding 








scientific relation- 








ships 

90 

93 

23 

Use 

IV 

Spatial visual!- 

8 

Figures 

10 

.92 

06 

Study further 


zation 

9 

Cubes 

10 

90 

06 

Study further 



10 

Length of time 

10 

64 

.04 

Drop 



11 

Hidden forms 

10 

.61 

07 

Drop 



12 

Line location 

15 

95 

09 

Study further 



13 

Matching parts 

10 

79 

.03 

Drop 


High-school grad 

e 







average 





.39 



Souncn M II Jones and II. W, Cose, 1955 


The development of a professional aptitude test is illustrated by a leport 
from the Univeisity of California iegarding a test for the selection and guid¬ 
ance of applicants to the Engineeung School (M H Jones and H. W Case, 
1955) Foui sections were developed a general abihty test, a mathematics 
test, a test lequirmg interpretation of scientific data, and a spatial test. Each 
section had several parts, adjusted in length according to therr presumed im¬ 
portance Table 44 shows reliability estimates on limited samples, and valid- 
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ities based on 533 engineering freshmen. Test 2 was dropped without com¬ 
plete trial, because of low reliability. 

The previous (high-school) grade average is the best predictor, this is a 
common finding in academic prediction. Such a predictor is to some degree 
unsatisfactory because grading standards vary in different schools, and test 
scores are a valuable supplement. The best tests are 5 (quantitative infer¬ 
ence), 4 (technical vocabulary), and 7 (scientific inteipietation) These 
tests, combined with high-school aveiage, correlate 51 with fieshman marks. 
Note that these tests aie all measures of achievement. The prediction fiom 
this three hours of testing would not be bettered by adding the tests with 
lowei validities, so tests 6, 10,11, and 13 were dropped Tests 1, 8, 9, and 12 
and a revised form of 3 were retamed because it was thought that this infor¬ 
mation might be useful m predicting advanced courses 

The California data do not tell clearly whether the special battery does 
better than a general predictor test would For evidence on this point, we 
turn to a report from the University of Utah (Pierson and Jex, 1951) The 
Utah College of Engineering had used the Pre-Engineering Inventoiy, a six- 
hour battery (which has subsequently been shortened to an eighty-minute 
test) covering mathematics and scientific comprehension It was found that 
high-school marks predicted college engineering marks with an r of 57 Add¬ 
ing subtests of the Pre-Engineering Inventoiy raised the correlation to about 
68 The nonspeciahzed Cooperative Achievement Tests, combined with 
high-school average, correlated about 65 with engineering grades (Just why 
the correlation at Utah is so much higher than at California is uncertain One 
possibility is that the California engineering students are more severely pre¬ 
selected ) 

This result is consistent with studies of other tests A specialized profes¬ 
sional aptitude test is not appreciably more effective as a predictor than a 
suitably weighted combination of ordinary measures of achievement and 
general mental abilities. The only professions which constitute exceptions to 
this rule are dentistry, where spatial and dexterity tests have piedictive 
value, and architecture, discussed above The mam reason foi having sepa¬ 
rate piofessional tests is administrative Such tests are oidinanly distributed 
through the national professional association oi some compaiable group and 
are not available to individual counselors This piotects the seciecy of ques¬ 
tions so that tests can be used fairly as a basis for admission to piofessional 
schools 

16. Why do measures of past achievement predict college marks better than meas¬ 
ures of general mental ability? In view of this fact, what function can group 
mental tests perform in college admission and counseling? 

17. How could a high school coach students desiring to enter Engineering School 
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at the University of California so as to raise their aptitude test scores’ Is it de¬ 
sirable to use tests which allow such coaching? 


Suggested Readings 

Fleishman, Edwin A Psychomotoi selection tests leseaich and application in the 
United States An Force Pei sound Pstjckl , 1956,9,449-468 
This is a descnption of the Air Foice psvchomotor tests and a summary of re¬ 
search on factois undeilying them, 

Schultz, Harold A Review of the Meiei Ait Tests I Ait Judgment In 0 K Buros 
(ed), The fold mental memi merits i /earhook Highland Park, N J Gryphon 
Piess, 1953 Pp 338-340 

In a ciitiquc of the Meier test, largely fiom the point of view of content valid¬ 
ity of the items, Schultz indicates how complex, and how little undeistood, is 
even this one aspect of aitistic ability 

Traxlei, Aithui E,, h olheis Validation of piofessional aptitude batteries Proceed¬ 
ings, 1950 Invitational Conference on Testing Problems Princeton Educa¬ 
tional Testing Scivice, 1951 Pp, 13-54 

A symposium describes effoits to develop special tests for accounting, law, 
dentistry, and medicine, 
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Personnel Selection and Classification 


WHEN we wish to predict success in task X, it would be convenient if we 
could look in a test catalog, find a test labeled “Test of Aptitude foi Task X,” 
and begin using that test for selection Unfortunately, the procedure re¬ 
quired to establish a selection progiam is much moie complicated One dif¬ 
ficulty is that tests with similar names measure different things, and that a test 
intended to predict a particular performance may be a poorer piedictor than 
a test made for quite another purpose Moie than one test may be required 
to cover all the aptitudes a particular job demands Another problem is that 
jobs are difficult to classify Some mechanical jobs seem to make psychomotor 
demands which almost anyone can satisfy, whereas success m othei jobs with 
similar titles depends almost entirely upon the psychomotor factois, No mat¬ 
ter how well a test has been developed and how thoroughly its author has 
validated it, no one knows how well it will pi edict in a particular practical 
situation until it is tried out there 

The employment manager or educational admissions officer can accept no 
test on face value, nor can he accept a test solely on the basis of reseaich con¬ 
ducted elsewhere, Sooner or later, nearly every test workei must carry out 
his own validation studies to determine whether his prediction methods are 
working While the piacticing tester may limit his studies to relatively simple 
follow-up, it is important to know the full pioceduie foi validation lesearch, 
since this establishes the basic logic of any study of piedictaon. 

In Chapter 2 we distinguished among various types of decisions foi which 
tests are used selection, classification, evaluation of treatments, and verifica¬ 
tion of scientific hypotheses While m one sense prediction is involved m us¬ 
ing tests for any of these puiposes, the empirical, critenon-oiiented valida¬ 
tion proceduies to be examined in this chaptei aie most directly relevant to 
selection and classification We shall devote the greater pait of our discus¬ 
sion to selection, centeimg on employee selection for the sake of clarity. The 
statements are equally relevant, however, to institutional selection (or re¬ 
jection) decisions m military, educational, and clinical settings 

324 
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procedures in prediction research 

To predict success m a job one chooses a number of tests for tryout, deter¬ 
mines then effectiveness experimentally, and devises a plan for using test 
scores in making decisions. One proeeduie lelies on ciude tiial and enoi 
the experimentei assembles a “shotgun” batteiy of all kinds of tests m the 
hope that one 01 more of them will prove effective. This method is dec linin g 
as we undei stand bettei why some tests aie valid and others are not Psychol¬ 
ogists developing test battenes today devote considerable thought to the 
characteristics of the job and the establishment of adequate cntena, as well 
as to the seaich foi piomising tests 

The stages m prediction lesearch are as follows 

Job analysis, to deteimine what chaiactenstics appear to make for 
success 01 failure. 

Choice of possibly useful tests to measuie these characteristics 

Administration of tests to an expeiimental gioup of workers 

Collection of cutcuon data showing how the expeiimental group of 
woikers succeeded on the job 

Analysis of the lclation between test scoie and success on the job, and 
installation of most effective selection plan 

Job Analysis 

The first step is to analyze the job to be predicted This analysis sets up 
hypotheses stating which abilities and habits contribute to or limit success m 
the job No machine-like procedure of checking off one by one all possible 
factois has ever been found successful. Instead, the psychologist studies the 
^ task with whatevei insight and psychological knowledge he can muster Job 
analysis is in laige measure an ait 

In older to make a successful analysis, one must first of all have wide back¬ 
ground in psychology. Undei standing of motivation, motoi habits and the 
organization of abilities, and knowledge of the multitude of tests now avail¬ 
able are required Detailed motion analysis will suggest what dextenties or 
coordinations arc impoitant Analysis of the stimuli to which the worker re¬ 
sponds may suggest need foi ceitain peiccptual or sensory abilities One fre¬ 
quent appioach is to compare good and poor employees. Simple studies 
often reveal essential chffeiences between good and bad performers. Study 
of workers m tiaming is helpful, since theii difficulties in learning may show 
what aptitude is needed to avoid failure Reseaich on piediction for other 
jobs draws attention to tests worth trying and sometimes suggests that cer¬ 
tain tests can be eliminated without furthei trial. No loutme or stereotyped 
approach is hkely to be successful, however The analyst must take off from 
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the experience of others, but unless he brings in new hypotheses he is un¬ 
likely to find a better method of predicting 

A ]ob analysis should be highly specific One should not state that success¬ 
ful workers have “mechanical ability”, one should instead define the ability 
as “knowledge of and ability to apply principles of gears,” 01 “speed in rou¬ 
tine two-handed manipulation, not involving much finger dextenty or think¬ 
ing ” Such clear definitions permit one to obtain or construct the most appro¬ 
priate test The analysis should not be confined to “aptitudes.” It should 
range over the entire field of abilities, habits, personality characteristics and 
interests, previous experiences, knowledge, physique, and so on 

Impressionistic ]ob analysis places heavy reliance on the opinions of a few 
observers A more systematic pioceduie for collecting opinions from a larger 
body of informants and 1 educing the effect of folkloie on the job analysis is 
the “critical-incident” technique developed by Flanagan (1954) The analyst 
asks a foreman or some other person well acquainted with the job to think of 
an individual who has done excellently on the job, and then to recall one par¬ 
ticular incident which showed this peison’s superiority Likewise, the inform¬ 
ant recalls a poor performer, perhaps one who had to be discharged, and the 
incident which led to the final verdict of unsuitability These incidents are 
conciete, and only one stage removed from field observation of good and 
poor performance, as can be seen in these two examples (Preston, 1948) 

This officer was instructed to land his P-80 on runway 15 He pedaled on the 
right lunway but lined up to land on runway 9 He was told to go around and 
line up and land on runway 15 again. This time he overshot and had to go around 
He was getting dangerously low on fuel so I personally talked him mound the 
pattern, putting him on his down-wind leg, and instructed him when to turn on 
base I asked him if he had runway 15 spotted and he said “Roger ” After ac¬ 
knowledging, he flew right by runway 15 and almost “spun-in” trying to turn m 
on runway 9. Being low on fuel, I told him to go ahead and land He came in hot 
and lan off the end of the lunway 

In meeting and acting as a pilot for general officeis this lieutenant has brought 
favorable comment upon himself thiough the accomplishment of the mission One 
specific case was when, through no fault of his own, an auciaft was allowed to de¬ 
part without a retired Major General on board Immediately upon being con¬ 
fronted by the general a lather ciusty old hud—he, without calling on me or 
any other supenor, arranged for his departuic to the ouginal destination m time 
to overtake his original aircraft 

Tlie incidents are classified into logical categories m order to identify varia¬ 
bles that may be measured for the puiposes of prediction 

The critical-incident method collects uchly suggestive data, avoiding 
vague generalities such as “This job requires good judgment" It is not, how¬ 
ever, a truly objective method If the folklore of the business says that truck 
drivers must have stamina but not necessarily much intelligence, the inform- 
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ant is likely to bring to mind incidents which support the stamina theory, 
and to forget the cases in which drivers made themselves valuable by recog¬ 
nizing mixups in their ordeis. The person who classifies incidents likewise 
can introduce stereotypes into the final lesult, but this disadvantage is pres¬ 
ent m any judgment of job requirements 

1. Prepare a list of the factors composing aptitude for one of the following jobs: 
making pie dough, operating a calculator of a particular type, driving a taxi, or 
schoolteaching of some one type 

2. For many pbs requiring long training, e g , physiotherapy, it is undesirable to 
take girls into training who will probably marry and drop out. What charac¬ 
teristics might distinguish between probable marriers and nonmarriers? 

Choice of Tests for Tryout 

Having a list of cliaractenstics piesumed to be important m a job, the in¬ 
vestigator must then find tests to measuie each He must make a choice be¬ 
tween seeking one test which is a composite of the job lequnements and 
seeking a group of tests, each of which is a puie and independent measure of 
one of the chaiacterislics. The formei method, which usually leads to tests 
of die job-replica type, requires the investigator to design a new test for the 
job. As we have seen in consideiing mechanical, psychomotor, artistic, and 
academic aptitudes, the relatively complex test which comes close to the re¬ 
quirements of the job generally gives lughei validity coefficients than simple 
tests, although simple tests may be equally useful when a number of them 
can be combined The specially designed job rephea has distinct disadvan¬ 
tages 

• In employee placement and in guidance it is moie economical to use a 
few tests which give infoimation about many jobs than to use a separate test 
for each job. 

• Work samples must be levised, lestandaidized, and revalidated when 
any change in die nature of the job is made A batteiy of simple tests can be 
revised to fit mmoi changes in the job by altering weights assigned to die 
tests or by adding pciliaps one more test 

Assuming that the investigator decides to use many tests, each for a partic¬ 
ular function, he must then choose among available tests or construct new 
ones. If die abilities die job seems to demand aie aheady measuied m pub¬ 
lished tests, such tests should be hied Naturally, not every test with a rele¬ 
vant name will be suitable, the investigator must considei the difficulty of 
the test, its appropriateness to die intelligence and education of his subjects, 
and the like. If the job calls for an ability only appi oximately represented m 
avadable aptitude tests, it is more desirable to make a new test to measure 
this ability than to obtain a pale distoited image of it from an indirect meas¬ 
ure. Without condemning the useful TMC, we can use it to lllustiate this 
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point. The items measure some general factor, but there are also group and 
unique factors among its items, which are drawn fiom all portions of physics 
and mechanics To select men for advanced electrical training, background 
and comprehension are significant, but it is probable that Bennett’s items 
dealing with electricity will give better con elations than his items on forces, 
motion, and buoyancy Inclusion of tire lattei items might, m fact, dilute 
the test so that it will fail to select good workers, wheieas a test covenng only 
electrical comprehension might be a good predictoi. Unless there is a close 
psychological correspondence between an available test and the job, a new 
test of the abihty must be constructed 

There is no simple answer to the question of published veisus homemade 
tests Tests validated on many ]obs are of distinct advantage in educational 
guidance, and to a lesser extent m employment work. Counselois would pre¬ 
fer to pi edict all jobs with a few tests But the test designed foi a specific job 
often has significantly highei validity than the test foi general use 

In addition to tests designed to yield predetermmed scores, many selec¬ 
tion studies employ tests which are no more than collections of heteroge¬ 
neous items The most common of these is the biographical inventory, used 
to identify background factors capable of predicting success. There is no 
published biogiaphical inventory of this type, but it is a simple matter to pre¬ 
pare a suitable collection of items covenng work and educational experience, 
hobbies, athletic backgiound, social activities, and family. In this miscellane¬ 
ous collection, each item is treated as a separate test, and its 1 elation to suc¬ 
cess on the job is examined empmcally. Each job will correlate with some of 
the items, and a score based on just those items can be used to predict suc¬ 
cess m that job The same technique of trying out a mixed assemblage of 
items and keying only those which predict the criterion can be applied to in¬ 
terest and personality tests Inventories scoied m this manner have often pre¬ 
dicted job success as well as ability tests, The wartime Air Foice Biographi¬ 
cal Data Blank corielated .30 with pilot success (the items scored being 
those winch distinguished successes from failuies m the first group studied). 
Among eighty different types of tests studied, the only ones yielding coeffi¬ 
cients higher than .30 were a pencil-paper test of knowledge about automo¬ 
bile driving (.32), instrument comprehension (32), and mechanical princi¬ 
ples (ca 35) (Guilford, 1947) Two of these three tests clearly depend on 
background experience 

3. What practical conditions would a department store consider m deciding 
whether to make a special test for each type of clerk or to use a published test 
for salespeople in general 7 

Experimental Trial 

The crucial step in prediction research is experimental trial of the mstru 
ments. One gives the tests to typical applicants and observes the correspond 



PERSONNEL SELECTION AND CLASSIFICATION 


329 


ence of test scores to success In piactical woik there is much piessure to 
omit the expenmental study, this piessure must be resisted. When the psy¬ 
chologist repoits to his boss that he believes test X will eliminate pooiei em¬ 
ployees, the boss is fai moie anxious to install the test and benefit fiom it at 
once than to withhold judgment duiing weeks 01 even yeais of investigation 
Full experimental taial is indispensable No hypothesis can be tiusted, be¬ 
cause there have been many instances m which “likely” tests pioved to be of 
no value in selection 

The nonpsychologist may piopose to use the test to eliminate poor men, 
and to study the siuvivois to dcteimme the relation between test and per¬ 
formance This is not a satisfactory plan. A test might not predict which of 
the acceptable men would do well on the job even though it could weed out 
failures. (Example. A heaung test would mle out some people as music stu¬ 
dents, but withm a selected group, all of whom could hear, it would not pie- 
- diet success.) Tiial on an unselecled group is necessary, moreover, to estab¬ 
lish critical scoies and weightings of tests. So important is expenmental tiial 
on an unselected population that the Air Force went to the tiouble to vali¬ 
date its selection methods by sending tbiough training 1300 men, a random 
sample of all eligible leciuits, even though it knew in advance that the ma¬ 
jority of these men would be failuies (DuBois, 1947). 

Subjects should take the tests with the same motivation that would exist in 
their ultimate use (see p 53) The investigatoi will try moie tests than he 
can use m his final prediction batteiy, since some will probably not be help¬ 
ful. This makes the tiial batteiy long, and special attention must be paid to 
maintaining cooperation from the subjects Sometimes one test can be tried 
at a time, but sooner oi later the entue selection battery must be validated 
on a smgle gioup 

4. Suppose an employer puts a test in use without tryout What harm can result 
from this, assuming that the validity of the test is zero or low positive? 


The Criterion 

After giving his lesls, the experimenter waits for evidence of good and 
poor job pcrfoimance. The experimental gioup is tieated in the same way 
as other woikcis, being given normal training and duties. After a suitable in¬ 
terval, data on success aie obtained Among the criteria often used are quan¬ 
tity of production, quality of pioduction, turnover, and opinions of foremen 
or supervisors As was explained on p 108, it is important that the criterion 
possess a high degree of validity. A test which can predict quality of work 
will seem to be a pool test if it is judged by a critenon which does not fanly 
indicate quality of work The cuter ion (oi set of criteria) should cover all 
important aspects of the job 

Criteria may be based on measured output, field observations, or ratings. 
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The criterion must have high reliability, An adequate number of observa¬ 
tions representative of noimal performance is lequired If a proficiency test 
is the criterion, it must meet the usual requnements of objectivity, stability, 
and validity Ratings are particulaily common cntena and are subject to 
many enors Methods of making latmgs more dependable are discussed in 
a later chapter (pp. 506 ff ). 

Frequently no single measure of success is suitable. One reason is that dif- 
feient workers seem best at different times The fast learner who does well 
at the end of a shoit training couise may not give as good ultimate results as 
a learner who contmues to gain m ability aftei lie starts work The workei 
who makes good grades m training sometimes lacks tempeiamental qualities 
for success on the job 

In every study, there is some hypothetical “ultimate criterion” which best 
represents what the selector desires to obtain. The medical school would, if 
it could, judge the success of its students by their lifetime contribution to the 
community where they practice This probably depends more on personality 
attributes than on abilities, it ceitamly is not veiy closely related to grades in 
biochemistry The student’s giades, however, aie likely to be the critenon in 
any selection research done by the medical school, they are available, and it 
is certainly tiue that the student who never giaduates will make no medical 
contribution. The most extensive effort to study an ultimate cuterion is the 
work on combat effectiveness in Korea Teams of observers and mteiviewers 
went to the theater of combat to obtain information on performance, these 
data were supplemented by latmgs from field commanders The validity of 
an Army test battery developed to predict peiformance in tiaimng and in 
maneuveis was 27 against these peacetime criteria, but only .17 against a 
combat critenon. A battery developed using the combat entenon correlated 
36 with both trammg and combat criteria. The important difference be¬ 
tween the two batteries was the inclusion of a personality questionnaire m 
the combat-valid battery (Willemm et al , 1958) 

More and more attention is being given to establishing plural criteria for 
success in the same job This is paiticularly important foi high-level jobs, 
tlieie aie a gieat many patterns of success among officers, executives, con¬ 
sulting engineers, or aitists Teacheis, for example, may be successful m dif¬ 
ferent ways' one may develop into a friend and counselor for youth, one may 
stimulate independent and courageous thinking m the few brightest pupils, 
another may oveicome the blockings that cause failuie among pooi students 
No one of these teachers is best, but all are necessaiy types It is impossible 
to find a single critenon that is adequate for comparing these different types 
of teaching success 

A striking study by Lennon and Baxtei (1945) inverts the normal proce¬ 
dure and detei mines what aspects of the criterion can be predicted by avail- 
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able tests Each item on a ninety-item checklist applied to clerical woikers 
was con elated against a levmon of Army Alpha and an aptitude test requir¬ 
ing alphabetizing, number checking, coding, digit counting, computation, 
and reading of tables Foi some aspects oi job peiformance, coirelations with 
the piedictois weic high, but olhei qualities weie not piedicted Checklist 
items weie well piedicted which dealt with undeistanding of the work, 
quantity and speed of woik, perfoimancc of multiple tasks, and avoidance 
of duplication of effoit Faults in quality of work, typing, shorthand, gram¬ 
mar orderliness, and “peisonality” weie not piedicted successfully Some of 
the results aie shown m Table 45 This study shows why it is difficult to pre¬ 
dict such a composite cntenon as “supeivisoi’s uiting of all-round perfoim- 
ance ” 


TABLE 45 Percentage of Office Workers Having High and Low Aptitude Scores 
Rated as Having Particular Characteristics 



Learning Ability Test 

Clerical Aptitude Test 

Characteristic 

High 27% 
(N = 58) 

Low 27% 
(N = 58) 

High 27% 
(N = 58) 

Low 27% 
(N = 58) 

His working instructions have to be re¬ 
peated frequently 

7 

12 

5 

5 

Has made helpful suggestions about work 
handled 

31 

29 

38 

22 

Often does necessary but unrequested 
work on his own initiative 

37 

26 

39 

25 

Checks his work for errors before re¬ 
leasing it 

51 

45 

52 

48 

Sometimes forgets matters which should 
receive prompt attention 

5 

7 

7 

7 

Is inclined to sacrifice accuracy for speed 

4 

3 

6 

5 


Boldface tvpe indicates that the diffcicnti between low nnel high group is probably a tm< difference 
lather than the lesult oi chant* in sampling 
Sounci Lennon «\nd Baxtei, 1945 


When iccoids have been collected to show which woikeis aie most suc¬ 
cessful, the final pioeedine m a selection study is to pioccss the data and 
identify the best pi edit loi s Beloic discussing (lie analysis of piediction data, 
it will be desnable to see the entue leseaich pioeess by examining an actual 
study 

5. What procedure might be suggested for selecting clerical workers, in view of the 
findings of Lennon and Baxter 5 * 7 8 

6 List several independent (nonduplicating) criteria which might be used to evalu¬ 
ate teacher success. 

7. List several independent criteria to consider in judging branch managers of an 
equipment firm Branches are responsible for both sales and service 

8. McNemar (1952) makes the following comment about a study of performance in 
clinical psychology: “It is sheer nonsense to have proceeded with an extensive 
testing and assessment prediction program without first having devised satisfac- 
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tory measures of that which was to be predicted." Yet this study was conducted 
by well-qualified and experienced persons and supported by a large appropria¬ 
tion from equally responsible psychologists in the Veterans Administration What 
arguments can be given on each side of this controversy 7 


Development of a Stenographic Aptitude Test 

Among gnls who undeitake the study of shorthand, quite a few fail, A test 
which could be given before training would saxc time and money spent in 
stenogiapliy courses and peimit girls to select a more appropnate vocation 
Moreovei, tests of general intelligence and the commonly measuied abilities 
have had only modeiate success in piedictmg f.uliues in shoithand Deemer 
therefore decided to develop a test geared specifically to the pioblem of pre¬ 
dicting success m learning stenography. The abilities to be built into the test 
would be those nnpoitant m stenography, even if they weic of no signifi¬ 
cance anywhere else m business 01 education 
A job analysis was made It was based upon a study of the shorthand sys¬ 
tems and the natuie of the job, rather than upon observation of stenogia- 
phers The resulting list of abilities was as follows (Deemei, 1944) 

During dictation The more efficient stenogiapliei will piobably he superioi to 
the less efficient m 

1 Ability to listen to what is being said dunng dictation, 1 e , facility with which 
she attaches meaning to each word dictated. 

2 Ability to wnte coirect outlines fluently and rapidly 

3 Ability to hold a number of woids m mind while wilting otheis 

4 Ability to be "behind the dictatoi” without becoming flumed 

5 Knowledge of symbols for complete words The less efficient stenographer will 
have to compose more outlines sound by sound dunng dictation 

6 Thoroughness in checking, dunng pauses m dictation, the outlines just written 

Dunng transcription The more efficient stenographer will piobably be supenoi 
to the less efficient in 

1 Ability to judge from the length of her notes wheie to begin the lettei on the 
page 

2 Ability to pi oduce lettei s which ai e neat and clean 
3. Ability to read the outlines she has wntten 

a To call up the woid oi words for which an outline stands, cither by recogniz¬ 
ing the outline as a whole or by decipheung the outline sound by sound 
b To choose, when necessaiy, the woid that fits the context 
4 Ability to spell the words 
5. Ability to type the words accurately and rapidly 
6 Ability to judge how far ahead to read befoie beginning to type. 

This list of abilities was shortened by eliminating aptitudes which all girls 
might be expected to possess to an adequate degree, by eliminating those 
abilities which would be developed in training, and by combining some abil- 
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ities Preliminary foims of the test were then designed, using the following 
exeicises. 1 

Speed of writing. Gills weie lequned to copy the Gettysburg Address in longhand 
as rapidly as possible This may be consideied a coordination or rate of move¬ 
ment test, duplicating many motor elements of shoithand writing 
Word discrimination . In this test, gnls choose which of two woids best fits a par¬ 
ticular context, as m “We are satisfied that 0111 (personal personnel) is com¬ 
pletely loyal to the firm ” This simulates the pioblem in shoithand of choosing 
the conect woid when an outline fits more than one woid This is a complex in¬ 
tellectual action involving veibal intelligence, vocubulaiy, and spelling 
Phonetic spelling Gills are to wnte, conectly spelled, the woids represented pho¬ 
netically by ashen, akshn, vejtabl, bleef, etc This simulates the pioblem m short¬ 
hand of lecalhng the entne woid fiom a phonetic symbol, and at the same time 
tests spelling 
Vocahulanj 

Sentence dictation The testei leads aloud sentences of varying length which the 
subjects take down The sentences mciease in length, so that the subject must 
eventually cairy many woids in mind. 

The next stage m the study was to try the pieliminaiy foims of the test 
Some items piovecl ambiguous, too easy, or too haid and weie lemoved. The 
final foim of the test foi validation was then piepaied Validity was deter¬ 
mined by admimstenng the test to 500 students entenng shorthand classes. 
During the next two years, vanous measuies of achievement weie collected 
Foi the total test scoie, the validity coefficients weie as follows, for diffeient 
ciitena 


Accuracy of transcription after one year of study, dictation at 60 wp m. or less 54 

Accuracy of transcription after two years of study dictation at 80 w.p m or less 65 

Accuracy of transcription after tvo years of study dictation at more than 80 wp m .70 

Accuracy of transcription after two years of study dictation of material, the short¬ 
hand outlines for which had not been studied (80 w p m ) .58 

’ Accuracy of transcription after two years of study, shorthand notes being transcribed 

two weeks after dictation (90 w p m or less) 65 

Rate of transcription at end of twu years of study 35 


These validity coefficients air high enough to justify using the test to identify 
girls likely to have difficuily m the com so Since a coefficient of .65 means 
that many false piedictions will be made in individual cases, a school will 
usually piofcr to use the test fo point out those who should have special at¬ 
tention fiom the teachei lathei than aibitianly to bar girls with low scores 
from trying shoithand 

9. What abilities listed in the job analysis are not represented in the final test? 
10. Deemer’s manual says, "No reliability coefficients are reported for this test be¬ 
cause it is felt that they add nothing to the reported validity coefficients If the 
validity coefficient is satisfactory, the reliability coefficient must be satisfactory." 

1 Items copyright 1944 by Walter L. Deemer and reproduced by permission of Science 
Research Associates 
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Although the latter sentence is defensible, one sometimes wishes reliability 
data also What important questions about Deemer’s test would be answered 
by reliability coefficients for subtests and total score? 

IT. What explanations can be offered for the failure of the validity coefficient to 
reach 1 00? What does this imply regarding ways to improve the prediction in 
further studies' 5 

12. The DAT battery, developed later than Deemer’s test, includes measures of 
spelling and verbal ability In a sample of 43 girls, the validity coefficients for 
predicting shorthand grades are VR, 45, Spelling, 63, Sentences, 58 Does 
Deemer's test appear worth using when DAT results are available -5 If neither 
test has been given, which should a school adopt to reduce the number of 
failures in shorthand? 


DRAWING CONCLUSIONS FROM SELECTION TESTS 
Strategy of Decision Making 

The test scenes, once obtained, aie translated into decisions according to 
some plan The plan oi strategij describes how scoies from vanous tests are 
to be combined, how they aie to be combined with nontest mfoimation, and 
what decision will be made foi any given combination of facts 

For the moment, we shall considei only decisions based on a single test 
score Wlieie theie is a definite number of vacancies to be filled, the obvious 
strategy is to rank individuals and fill vacancies fiom the top of the list If 
there is no limit on the number of persons to be selected, the stiategy takes 
the form of a “cutting scoie ” All persons below this scoie are iejected. 

The cutting score is determined from the scattei diagram oi expectancy 
table The validation data indicate what degree of success is to be expected 
fiom persons m each score level The decision maker decides what level of 
risk he is willing to accept and fixes the critical score accordingly The ex¬ 
pectancy table m Figure 59 shows how engmeenng marks at the University 
of Idaho in a certain yeai correspond to ACE scores A giade average below 
2 0 is regarded as unsatisfactory The mvestigatoi, aftei examining the suc¬ 
cessive columns of the table, set 85 as his critical scoie. Any person below 
that scoie was discouraged from entering the School of Engmeenng 

A standard teimmology borrowed from medical diagnosis is applied to 
decisions involving two definite categories, such as “succeed" and “fail” or 
“brain damaged” and “not brain damaged” A peison who shows the “bad” 
sign on a test is said to be a “positive ” This terminology is m some ways con¬ 
fusing, but the medical slang is, of couise, a shoit cut foi “cases where the 
test gives a positive indication of the disease ” A follow-up study divides pos¬ 
itives into two groups “hits” or true positives, who do turn out to have the 
weaknesses indicated, and “false positives,” who turn out not to belong m 
the “bad” categoiy where the test placed them Among those who aie cleared 
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FIG 59 Scatter diagram for the ACE test as a predictor of engineering 
grades (Sessions, 1955) 


by the tost ( the “negatives”), the peisons who should have been identified 
as positive aie called “misses,” No special name is given to the tiue negatives, 
who aie oidmanly much the hugest fiaction of the sample In the engmcei- 
mg study, theie were 16 positives out of 147 Of the 16 positives, 14 weie hits 
and 2 weie false positives The test let 32 misses slip into the engmecimg 
school 

It is geneially unwise to set a cutting score directly fiom the law data m 
the scattei diagiam In the engineering data, it looks as if there is a maiked 
diffeience between peisons scoung 85-89 and those scoung 80-84 Thiee- 
fourths of one gioup pass, whcieas none of the otheis pass This abiupt de¬ 
cline is almost ccilainly due to the fact that only a limited sample was stud¬ 
ied With moie cases, theie would be moie misses m the 85-89 column and 
some false positives m the 80-84 column Figure 60 peimits us to estimate 
what will happen m a huge population The dots in that figuie aie the pro¬ 
portions of failure m each five-point intei val The line fitted to the points 
gives an estimate of the tiend of failuie m the population of which these 131 
cases are a sample, i.e , of the tiend to be expected m othei samples Estimat¬ 
ing from this line, the failuie rate at 85 (the cutting score originally pro¬ 
posed) is about 62 peicent 

Setting a cutting score lequnes a value judgment If we accept a cutting 
scoie of 85, it means we wish to i eject peisons who have less than an even 
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chance of passing One college administration might decide that it could not 
afford to admit boys unless they have a 70 peicent chance for survival, if so, 
the cutting score should be 105. Another admmisti ation might leave the 
choice to the student except where there is a very high probability of failure. 
For this puipose a cutting score of 75 might be set to rule out those boys who 
have a 4-to-l chance of failing. The administrator who loweis the cutting 
score reduces his false positive late, that is, he runs less risk of cutting out a 
satisfactoiy student At the same time, however, he mci eases his numbei of 



FIG 60 Probability of success In engineering as a function of test score 


misses The choice of cutting score cannot be made scientifically. It is a deci¬ 
sion based on personal, social, and economic values, combined with practical 
considerations 


Some of the arguments which lead one to shift the cutting score downward 
(accepting more students who will fail) aie these 


® A “failure" is not a total loss The student will gam a good deal from a 
year of college, even if he then drops out. If admitted he will become of 
gieatei value to society because of whatever he learns, 

• If the boy is refused admission, he may be a total loss to higher educa¬ 
tion If he is enioiled, fmthei investigation can peihaps identify deficiencies 
to be removed oi help him work out a plan in which he has a gi eater chance 
of success 


® When the country needs engineers very badly, it is important to 
process even low-grade oie to get a few students who will graduate. 

• Tests are fallible A decision to admit the boy is really a decision to 
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continue testing him by means of his class peifoimance. There is no way to 
continue testing a boy who is rejected. Enoneous decisions to reject cannot 
be coirected 

On the otliei side, the aiguments foi a high cntical scoie include these, 
a Accepting a boy who is unlikely to succeed wastes educational ie- 
souices He takes staff time which might bettei be spent on moie promising 
students. His piesence in the group lowers the level of discussion and thus 
robs the better students 

0 The boy who is going to fail is bettei off facing the fact at once, lathei 
than aftei he has wasted a yeai ITe can use the year to get staited m a more 
suitable tiade or course of study 

In geneial, the pioblem is to weigh the loss fiom accepting a failuie 
against the loss fiom rejecting a peison who will make good The piopei 
cutting scoie is one at which these two nslcs aie m balance 
The examples discussed above assume that peifoimance mcieases as test 
score increases When luinovei is used as a cntenon, a diffeient type of rela¬ 
tion is at times encounteicd Tuinovei sometimes is found to be lelatively 
gieat foi men with veiy high and veiy low aptitude, whcieas men m the mid¬ 
dle langc tend to stay on the job In a study of taxi duveis, seven out of ten 
tests of disci linmation, motoi speed, and leasonmg showed such a relation 
to the ciiterion Data foi two of the tests aie plotted in Figuie 61 It is lea- 




FIG 61 Curvilinear relations between predictors and turnover of taxi drivers (C W Brown and 
E E Ghiselli, 1953) 

sonable to suppose that the poorest men drop out because of difficulty on the 
job, while the best men are able to move to some more satisfymg or better- 
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paid job For a situation like this, a double cutoff plan might be necessary to 
eliminate men at eithei extreme of the ability scale 
Expectancies and critical scores may change lapidly in a penod when in¬ 
stitutions aie changing One striking example of such change is the report of 
grades of students enteung medical school summarized m Table 46 The 1955 

TABLE 46. Caliber of First-Year Medical Students 
in Successive Years 


Percentage Having an Undergraduate 
Grade Average of 
ABC 


1950-1951 

40 

43 

17 

1951-1952 

30 

55 

15 

1952-1953 

18 

68 

14 

1953-1954 

21 

69 

10 

1954-1955 

17 

69 

14 

1955-1956 

16 

71 

14 


Sovuce Anon , 1956 


admissions are strikingly poorer than those in 1950 The schools evidently 
were able to hold the same cntical score for admission, but they atti acted far 
fewer “A” applicants. For the college counselor and his client considering 
medical school, the change is of gieat impoitance An undeigiaduate with a 
B+ average would have been only an average medical student in 1950, but 
m 1955 he would have been near the 75th percentile of his class If the tiend 
(at piesent unexplained) continues, medical schools will find it necessaiy to 
reduce their demands upon students, and counselors will encourage students 
to enter medicine who would formeily have been regaided as nnpiomismg 
candidates 

13 . In Figures 59 and 60, what cutting score would be used to eliminate students 
with one chance in three of failing’ 

14. What assumptions are made if the cutting score of 85 on the ACE, proposed 
for the University of Idaho, is applied in other engineering schools’ 

15 A screening test is applied to school children to identify those in poor mental 
health so that they can be given intensive study by the school psychologist 
What factors argue for a high cutting score’ What factors argue for a low 
cutting score’ 

16 . A large office has about ten vacancies a month for clerk-typists It places ac¬ 
cepted applicants on a waiting list, and when a vacancy occurs offers the |ob to 
applicants in order of their application Is the personnel department free to set 
a cutting score which insures that 95 percent of the girls hired will be success- 

Till? 

17 . Which is to be preferred, false positives or misses, in each of these situations’ 
a Patients entering a hospital are given a reasoning test which gives a rough 

indication of organic brain damage. Positives are given a thorough neu- 
rological examination. 
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b. Candidates for admission to teacher training are screened for ability 

c. It is important to hire skilled sheet-metal workers to fill vacancies, during a 
time of tight labor supply Men cannot be trained on the |ob. 

d In inducting soldiers, a mental test is used to determine which men are too 
dull to be useful to the service 

e. A company wishes to hire mechanics and put them through an expensive 
training program, success cannot be observed until the end of the course 
18 . The following is taken from a letter to the New York Times 

"I submit that 'slaughter on the highway' will continue until state licensing 
authorities recognize some simple facts To drive a car on today's highways 
demands a rather complex set of sensori-motor skills These skills are 'normally' 
distributed, i.e , some folks have them, some do not Instruments are available 
to measure these skills Authorities have some remote responsibility here to see 
that such instruments are used before licensing " 

a. What degree of validity should be required before tests are used as pro¬ 
posed? 

b. If scores are normally distributed, how should the cutoff score be fixed’ 


Combining Data from Several Tests 

When seveial tests me tiled out, the lesults must be evaluated m older to 
decide which test is best, and to deteirmne the most useful combination of 
tests foi piedicting If only one lest is to be used in selection, we will oidinar- 
lly select that which has the highest validity An exception to this mle occius 
when the best test is quite expensive to apply and some simpler test yields 
neaily as good a conelation A second exception occurs when the tests hied 
out have quite unequal leliabilities If a lehable test has the best validity, 
and the mnnei-up is notably unicliable, the best pioceduie may be to 
lengthen the lattei test to increase its validity (cf. p. 130) Prediction is oidi- 
nanly improved by combining seveial tests which covei diffeient lelevant 
aptitudes. 

Multiple Correlation and Statistical Weighting It is customaiy to employ 
multiple-correlation techniques to select the most effective combination of 
tests, to deteimme how they should be weighted m amvmg at a final pie- 
diction, and to assess the effectiveness of the composite piedictoi. Formulas 
foi computing multiple conelation aie given m most statistics texts 

To obtain a high multiple conelation, tests aie sought which have a posi¬ 
tive coirelation with the criterion and low conelations with each othei 
Theie is little value in combining tests of the same ability, this is equivalent 
to making the ongmal test moie reliable, and usually laises the validity only 
slightly. But if a new test measures a component of the job not estimated by 
the fiist test, it will improve the multiple conelation appreciably The exam¬ 
ple in Table 47 shows how prediction improves when we combine several 
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tests with low validity. It also shows that the multiple correlation reaches a 
ceiling very rapidly, so that adding tests beyond the first three or four raiely 


TABLE 47. Effect on Multiple Correlation of Adding Tests to Battery 


Tests 

Correlation with 
Criterion (Shop 
Performance of 
Junior-High-School 
Boys) 

Multiple Correlation 
of Criterion with 
First Test, First Two 
Tests, First Three 
Tests, Etc. 

Paper Form Board 

.43 

43 

Stenquist Assembly 

26 

.44 

Steadiness 

29 

53 

Card sorting 

.27 

563 

Tapping 

18 

580 

Spatial Relations Formboard 

36 

594 

Packing blocks 

.28 

5953 


Source Paterson et (it, 1930, p 83 


is valuable More elaborate prediction batteries are worth while only when 
each added test measures a new factoi 

One can afford to discaid tests from a tiial selection battery even though 
they have positive couelations with the ciiteiion. There is little value m ex¬ 
tending a batteiy by adding even reliable tests, if they duplicate abilities al- 
leady measuied. The following correlations between tests and elimination 
from flight training were found by the Air Foice (DuBois, 1947, p 194); 

Pilot starnne (1 e , composite scoie on selection batteiy) 653 
Stamne plus Qualifying examination 655 

Stanme plus Qualifying plus Geneial Classification Test 655 

The multiple-correlation piocedure starts with the test validities and test 
intercomelations Customarily, one selects the test having the highest validity 
as the first member of the composite piedictoi Then the intercom elations ' 
aie examined systematically to deteimine which test pi edicts the emterion 
and at the same time least duplicates the test already chosen The third test, 
in turn, must be one which overlaps little with tire first two. Out of the same 
computations come a set of weights, which place heavy emphasis on the tests 
fust selected for the battery and smaller emphasis on the tests added latei. A 
cutting scoie for the weighted composite is established in tire same manner 
as ior a single test 

Such a set of weights is illustrated in Table 48, which shows how tests were 
combined by the Aii Force to predict graduation from pilot, bombaidier, 
and navigator training The same tests were used foi all entering cadets, but 
a different combining formula was required for each job In selecting bom- 
baidieis, for example, disciimmation reaction time and finger dexterity 
co urit ed heavily, whereas reading and arithmetic had very little weight The 
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TABLE 48. Validity Data and Combining Weights Used in an Air Force 
Classification Program 


Test 

Correlation 

with Criterion 

Relative Weight 

Bomb 

Nav. 

Pilot 

Bomb 

Nav 

Pilot 

Printed tests 







Reading Comprehension 

.12 

32 

19 

8 

2 

_ 

Spatial Orientation 1! 

09 

33 

25 


10 

5 

Spatial Orientation 1 

12 

38 

20 

-- 

9 

6 

Dial and Table Reading 

19 

53 

19 

14 

18 

4 

Biographical Data—pilot 

— 

— 

32 

— 

— 

15 

Biographical Data—navigator 

— 

23 

-.03 

— 

9 

— 

Mechanical Principles 

08 

13 

.32 

_ .. 

_ 

8 

Technical Vocabulary—pilot 

04 

10 

30 

__ 

_ 

13 

Technical Vocabulary—nav 

04 

.22 

09 

— 

._ 


Mathematics 

10 

50 

08 

-- 

18 

_ 

Arithmetic Reasoning 

12 

45 

09 

8 

12 

— 

Instrument Comprehension 1 

— 

— 

151 




Instrument Comprehension II 

— 

— 

35J 




Numerical Operations, front 

13 

26 

01 

.- 

— 

__ 

Numerical Operations, back 

11 

.28 

.02 

•_ 

— 

— 

Speed of Identification 

09 

.19 

18 

—- 

— 

—■ 

Apparatus tests. 







Rotary Pursuit 

14 

10 

21 

12 

— 

4 

Complex Coordination 

18 

.24 

.38 

12 

— 

17 

Finger Dexterity 

16 

20 

11 

19 

6 

— 

Discrimination Reaction Time 

.22 

36 

22 

27 

6 

4 

Two-Hand Coordination 

12 

.26 

30 

— 

11 

4 

Rudder Control 

— 

— 

42 

— 

— 

12 


Note* lhe critenon for the various validity coefficients is graduation 01 nongraduation from training 
Source DuBois, 1947, pp 99, 101 


navigator scoie, on the other hand, depended primarily on these intellectual 
abilities 

A combination of tests makes the gieatest contribution when the original 
battery contains independent tests measuimg quite different factois. If the 
tests overlap to a laige degiee, the pioceduie will eliminate most of the tests 
and it is almost ceitara that some aspects of the critenon will not be meas¬ 
ured. One of the majoi claims of factoi analysis is that it will ultimately per¬ 
mit the pieparation of “pure” tests, each measuimg one factoi and veiy little 
else These vanous factois can then be put togethei in whatevei piopoition 
a given cnterion demands, wheieas an impuie test puts both wanted and 
unwanted factors into the composite This aigument is logical enough, and is 
increasingly being attained in such battenes as the GATB Puie tests have 
not been easy to devise, however. 

19. The Holzinger-Crowder manual offers weights for estimating relative scores on 
certain genera! mental tests Interpret this equation 

Estimated CTMM standing = 7 5V + 3S + 9N -f 1.3/? 

20. How do you account for the different weights assigned the first two tests in 
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Table 48 for havigator prediction, in view of their similar validity coefficients? 
2! Which of the three aircrew |obs has the smallest psychomotor component, ac¬ 
cording to the prediction weights? 

Multiple Critical Scores Instead of pooling all scores into a composite, 
some personnel woikeis select their tests either by multiple correlation or by 
other methods and then use a strategy known as the multiple cutoff or multi¬ 
ple critical scoie. For each test a sepaiate critical seoie is determined Tho 
critical scores aie adjusted so that poisons who pass all the huidles have a 
satisfactory piobabihty of acceptable performance. 

The most extensive use of this plan is in the application of the GATB bat¬ 
tery Occupational standards are established by considering the average 
score of successful workeis m the job, and also the correlation of the tests 
with the criterion and with each othei An example is the standaid for the 
job of “mounter " A mounter assembles radio-tube mounts and connects veiy 
small parts and wires, welding them m place The passing standard for this 
occupation on the oldei form of the GATB is Foim Pci cep lion (P) 85, Aim¬ 
ing 85, Fmgei Dextenty 90, Manual Devteiity 85. The cutting score elimi¬ 
nates the lowei third of the woikeis, theieby weeding out the most probable 
failures. 

The validity data m Table 49, used to establish the standards, were based 
on 65 cases The validity coefficients suggest selecting on scoies F, M, A, and 
T, but the high coirelation of A with T indicates that only one of the two 

TABLE 49 Data Used in Establishing GATB Occupational Standards 
for Mounters 


Aptitude Score 

M 

s.d 

Correlation 
with Produc¬ 
tion Records 

G —General 

106 9 

15.3 

— 075 

V —Verbal 

102.2 

147 

- 061 

N—Numerical 

105 8 

13 3 

064 

S—Spatial 

109 3 

166 

— 009 

P—Form Perception 

111 8 

156 

015 

Q—Clerical Perception 

106 2 

159 

097 

A—Aiming 

1071 

139 

229 

T—-Motor Speed 

103 6 

155 

191 

F —Finger Dexterity 

109 5 

184 

437 

M—Manual Dexterity 

98 7 

20.7 

.353 

Soubce Guide to the Use of GATB, 

1958, p 1-1 15 




need be used. (In the more recent form, which does not measure Aiming, 
a cutoff of 85 on K [motor speed] has been substituted.) The USES noted 
the additional fact that these woikeis weie distinctly above average m P The 
implication is that perceptual ability may be a selective influence leading 
some people to enter or remain in the job of mounter even though it does not 
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correlate with performance within the selected group. For this reason, P was 
added as a hurdle 

The validity of the composite pattern (F, P, A, M) was tested by a tetra- 
choric coirelation calculated from a table of hits, misses, and false positives 
The correlations foi tluee samples, using production records as a criterion, 
weie 46, 49, and 52. The composite is thus a tnfle better than prediction 
from F alone would be 

22. In a fourth sample of mounters, supervisors’ ratings were used as a criterion. 

How do you account for the fact that the correlation was only 24 ? 

Comparison of Composite and Cutoff Procedures The weighted composite 
ranks individuals according to their expected criterion scoies, and a single 
cutoff is established All persons whose composite is above that level are ac¬ 
cepted A graphic illustration is given in the left panel of Figure 62, all per- 



Weighted Composite Multiple Cutoff 

FIG 62 Two selection plans 


sons above the line are accepted The multiple cutoff eliminates persons who 
aie low on eithei test, with the result diagramed in the right-hand panel of 
Figure 62 Foi most individuals both piocedures lead to the same decision 
Persons m areas 1 and 2, in the lower-right and upper-left corners of the 
scatter diagram, aie low on one test and high on the other They aie rejected 
by the critical-scoie pioceduie but are accepted when a weighted composite 
is used The peisons m area 3 aie just above the minimum on both tests, they 
are accepted by a multiple-cutoff plan but are rejected when the composite 
score is used. 

The advantages of the weighted composite are as follows: 

• It gives additional information, indicating how each accepted man 
ranks within the group This is useful for identifying men requiring special 
assistance during training, or for singling out superior men for special re¬ 
sponsibility 
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® By estimating the man’s probable success m the assignment it permits 
comparison with his probable success in other lines instead of merely lulmg 
out the assignments where he would fail 

® If theie is a lineal relation between tests and the cntenon—a matter to 
be discussed furthei below—it gives a highei piopoition of correct decisions 
than the multiple cutoff 

The multiple cutoff has these advantages 

• It is easiei to compute, admimstei, and explain to laymen than is the 
composite score 

• Retaining the scores of sepaiate tests provides a valuable basis foi 
counseling 

• When the 1 elation between tests and criterion is configuial rathei than 
linear, the multiple cutoff can yield a higher proportion of collect decisions 
than the customary weighted composite 

The essential difference is that the weighted composite acts on tire as¬ 
sumption of compensation among abilities A peison weak in dextenty may 
be accepted if he has exceptional perceptual ability, strength in the one is 
presumed to make up foi weakness m the other In most predictions this as¬ 
sumption is justified, but not m all. 

At one tune during World War II the Navy desired to tram men to opei- 
ate antisubmarine listening gear By the usual correlation procedure, psy¬ 
chologists established a piediction formula men were scieened on geneial 
intelligence, and within the suivivmg gioup, predictions were based on the 
average of the Bennett TMC and several Seashore tonal tests Following 
standard Navy proceduie, acceptable men were sent to training school, and 
those who failed m school were assigned to general sea duty It was there¬ 
fore a serious mattei when a man of good intelligence was sent to a school 
for which he was unqualified, smee his ability would not be properly used, 
Many men who failed in sonar school did so because of very poor tonal 
judgment, which made them unfitted for listening duty How had they hap¬ 
pened to be sent to sound training? Their high mechanical comprehension 
(many had studied college physics) raised their composite enough to con¬ 
ceal their weakness Such men, despite an adequate “average" ability, were 
doomed to fail m sound training whereas they would have been excellent m 
engmeenng, radai maintenance, or navigation In fact, a few were salvaged 
by school officeis and sent to other training, where they did well Ultimately, 
a multiple-cutoff procedure was adopted 

Configural Prediction. The estimation equation deuved by the usual 
multiple-correlation method is a linear, additive equation. That is to say es¬ 
timated scores aie obtained by adding piedictors The criterion score is as¬ 
sumed to increase regularly as a function of the predictor score This linear 
assumption is not always justified A certain amount of manual dexterity may 
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be required on an assembly line, but increased dexterity above the minimum 
makes little or no diffeience in the value of an employee if the assem¬ 
bly line moves at a fixed speed, A woiker who keeps ahead of the line pro¬ 
duces no more than one who can just comfortably keep up His greater dex- 
teuty does not compensate foi weakness m some other ability. 

Many wi iters have condemned conventional additive prediction methods, 
claiming that “patterns” or “configurations” of scoics will yield better predic¬ 
tion. We cannot consider this issue fully here, and confine ouiselves to a few 
summary statements: 

e The wi iters who propose to interpret “patterns” are often vague m 
their proposals, and wnteis mean diffeient things by that term. Some of the 
waters advocate examination of simple differences between abilities which 
the multiple-correlation procedure automatically takes into account Others 
are arguing for nothing moie complicated than a multiple-cutoff piocedure 
e The multiple-correlation method can be extended to take into account 
configural relationships of any degree of complexity It need not be limited 
by the linear assumption 

o In practical prediction pioblems, a linear assumption has almost always 
proved adequate to account foi the data, and nothing is gamed by introduc¬ 
ing configural formulas unless data are uncommonly leliable 

• In a limited number of investigations, configural treatment of scores 
has peimitted much better piediction than a linear composite Most of these 
investigations have involved personality vanables as predictors 

One pioneer, highly tentative study suggests what may be done with con¬ 
figural prediction Frederiksen and Melville (1954) thought it possible that 
for compulsive students, who work haid on tasks even when they aie not in¬ 
terested, achievement in engineering would be predicted by ability tests and 
not by interest tests. Among noncompulsive students, howevei, who work 
hard only on what interests them, they thought an inteiest test might be an 
nnpoitant additional predictor The mvestigatois used two indications of 
compulsiveness or unusual concern with accuracy: having mteiests like those 
of professional accountants, and having low speed on a reading test relative 
to the vocabulary scoie. Some of the validity coefficients of piedictors against 
average grades were as follows: 



Compulsives 

Noncompu/sives 

High-school grades 

47 

.50 

Mathematics test 

31 

61 

Interest in engineering 

-.18 

.36 

Interest in selling real estate 

- 04 

-.55 


The evidence thus suppoils the hunch that for noncompulsive students in¬ 
terest in the subject matter is an important factor m success, but that foi com¬ 
pulsive students interests make no difference If this relation were verified in 
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further woik, the correct selection procedure would be to divide men into 
two groups. For the compulsive gioup, a composite ability score would be 
used, for the noncompulsives, a diffeient combining formula considering in¬ 
terests would be needed 

Sequential Strategies 

In the multiple-cutoff 01 composite-scoie plan, all the tests are given at 
once and the decision is then made Foi many personnel decisions a moie 
economical procedure is the sequential plan, in which testing is divided into 
several stages After each stage, a decision is made to leject some men, to ac¬ 
cept others, and to continue testing those close to the boideihne 

The decisions m die fiisl stage of a sequential plan aie based on a shoit 
and incomplete test, and the method theiefoie makes somewhat moie wiong 
decisions than a plan m which every man takes all tests The plan 1 educes 
costs, however, because lelatively few men take the second and later sets of 
tests Especially when the latei stages include expensive apparatus tests 01 
interviews, the saving can be consideiable Sequential methods aie geneially 
supenor to single-stage testing, since gieat savings in testing cost can be ac¬ 
complished with a veiy small leduction in the correctness of final decisions 
(Arbous and Sichel, 1952, Cronbach and Gleser, 1957, pp 48-63, 82-87) 

23. What practical considerations determine whether a sequential plan should be 
used in hiring workers for training 7 

Nonstatistical Combination of Scores 

AH the procedures discussed so fai involve an experimental determination 
of the best possible lule foi selection, and ngorous application of that rule 
The assumption undei lying these appioaches is that a statistical formula will 
moie often point to the correct decision than will a pioceduie which depends 
on the judgment of a psychologist The psychologist with a clinical onenta- 
tion often complains that such a mechanical method is insensitive to the 
unique characteristics of the paiticulai case and cannot possibly be as wise 
as a psychologist 

It seems that a competent piofessional mteipietei, bunging leason to bear 
on the data, should do no haim and might well nnpiove decisions, but this 
expectation is contradicted by expenence In the Navy, lor example, trained 
classification specialists mtei viewed each man, having at hand his test scoies, 
a life histoiy, and other data The inteivieweis gave a final latmg as to the 
man’s probable success m the training to which he was assigned A mechani¬ 
cal prediction foimula combining two tests (Electrical Knowledge and 
Arithmetic Reasoning) correlated 50 with success m training of electrician’s 
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mates The interviewer's rating, based on these tests plus judgment, corre¬ 
lated only .41 with success In other words, judgments depaitmg from the 
statistical formula leduced the coirectness of prediction (Conrad and Sat- 
ter, 1945). P E Meehl (1955) made a major comparison of "clinical vs sta¬ 
tistical prediction” in which he examined every study wheie predictions 
made by judges could be compared with predictions made from the same 
data by statistical formula In some twenty studies where such a comparison 
could be made, he found that the actuarial, cookbook prediction was equal 
or supenoi to the judgmental prediction in every case save one The statisti¬ 
cal method, which is obviously cheaper to apply, beats the judge time aftei 
time, whethei the judge be a counselor, a clinical psychologist, oi an mdus- 
tiial personnel managei 

Why does the judge do so poorly? The foremost reason is that he combines 
the data by means of an intuitive weighting which he has not checked The 
statistical foimula, on the other hand, has been carefully checked on a sam¬ 
ple of cases like the one foi whom the new prediction is made It uses the 
best possible set of weights The judge can beat the foimula only by bring¬ 
ing m additional data and combining those data m the proper mannei with 
the facts used by the foimula 

It is very difficult for a judge to function efficiently In the first place, he 
does not know what weights he uses to arrive at a decision He looks at the 
man fiom vaiious angles and finally comes to an intuitive decision. Almost 
certainly, he gives gieatei weight to some factois than they deserve, and 
changes his weights from one case to the next Moieovei, his judgment is un¬ 
reliable, m the sense that he might judge the same case differently on diffei- 
ent days The formula never vanes There is some reason to think that judges 
give too much weight to the additional facts they add to the test data 

Judges make many constant eirors They have steieotypes and piejudices, 
tor example, they make different predictions foi women from what they 
would for men with similar scores even when there is no evidence that men 
and women peiform differently on the job In one of the most interesting 
studies, counselois judged badly because they applied a completely sound 
principle m a situation wheie foi some stiange reason the punciple was in¬ 
appropriate Saibm (1943, see, also Cionbacli, 1955b) asked counselois to 
predict grade averages of students at Minnesota fiom then high-school lec- 
oids, ACE test scoies, and a whole dossier of information on interests, expen- 
ences, and motivations The statistical formula combining ACE and high 
school rank had a validity of 45 for men, 70 for women The counselois did 
a little worse 35 for men, 69 foi women When we examind the weights 
used, we find that the formula placed almost its entire weight on high-school 
rank and paid no attention to ACE scoies because m the three piecedmg 
classes at Minnesota the ACE had made no independent contribution to pi e- 
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diction. The counselors, however, gave about equal weight to high-school 
marks and to ACE scores Such a weighting had been found best in most of 
the leported investigations of college success, and is quite consistent with ex¬ 
perience of colleges generally For some ieason, the Minnesota situation dur¬ 
ing the period of this study was unique, quite reasonable weights used by 
the counselors were the wrong ones foi this situation The statistical foimula 
was custom-made for the Minnesota situation and of course it did better than 
the counselors woikmg from geneial psychological lore. 

What does this imply? It implies that counselois, personnel manageis, and 
clinical psychologists should use formal statistical piocedures wheievei pos¬ 
sible to find the best combining formula and the true expectancies for theii 
own situation. They should then be extiemely cautious m depaitmg from the 
recommendations amved at on the basis of the statistics, unless they aie 
stiongly convinced that the additional information they bung in is a valid 
basis for decision When they do use theii own judgment, they should make 
careful follow-up studies, comparing then numbei of hits with the number 
of hits the formula would have yielded Moieover, those who make judg¬ 
ments must try to foimulate the bases they use, stating just what they take 
mto account and how heavily they weight each bit of mfoimation 

The piopei piovmce of the statistical formula is the institutional decision, 
wheie a definite and meversible decision is requned to cany out the pur¬ 
poses of the institution The admissions officer having to choose the most 
piomising applicants for a limited numbei of openings should certainly make 
decisions in whatever way will be most accurate In counseling, on the other 
hand, decisions aie personal decisions of the client and cannot be dictated 
by the experience table The counselors responsibility is to help the client 
understand himself, if, having the facts befoie him, he wishes to embark on 
a course m which failuie is likely, it is his light to use his life that way 


INTERPRETING SELECTION STUDIES 
What Is an Acceptable Validity Coefficient? 

As validity coefficients foi various tests have been presented m past chap¬ 
ters, the reader piobably has been classifying them mentally as "good” or 
“pool ” Many tests, paiticulaily those of special abilities, do not seem veiy 
satisfactory at fiist glance But, m one sense, a test has a satisfying validity 
coefficient if it is better than otliei tests for the same puipose The only sound 
standard for judging a validity coefficient is the question. Does the test per¬ 
mit us to make a bettei judgment than we could make without it—suffi¬ 
ciently better to justify its cost? The older literature on testing placed little 
value on tests with model ate validity coefficients A so-called “coefficient of 



PERSONNEL SELECTION AND CLASSIFICATION 349 


foiecasting efficiency” was computed which purpoited to tell how much bet¬ 
ter a prediction from the test was than a landom guess. Aecoiding to this 
coefficient, validity had to leach .86 befoie a test was “50 peicent bettei than 
chance.” Tests with validity below .50 weie thought to have negligible piac- 
tical value This line of leasomng, we now know, is based on mappiopnate 
assumptions (Cionbach and Gleser, 1957) The readei is advised to disie- 
gaid the coefficient of foiecasting efficiency and its implications, if he en¬ 
counters them m his leading 

Psychologists have abandoned then insistence on validity coefficients of 
.70 or 80 foi all tests While we would be pleased to reach these oi bettei 
levels, the expenence of thnty yeais of piactical testing shows that we can¬ 
not often attain such standaids Coefficients as low as 30 aie of definite piac¬ 
tical value (cf Table 48). Occasionally, a test with much lowei validity is 
piomismg foi fuithci development, if it measuies what no othei test does In 
, discussing this point, Stiong comments that the test cutic who is contemp¬ 
tuous of low positive coi relations is quite willing to accept mfoimation of no 
gieater dependability “when lie plays golf or employs a physician,” The cor¬ 
relation of golf scoi es between the first and second eighteen holes m champi¬ 
onship play is, he says, about .30, and the i eliability of medical diagnosis neai 
40 (Stiong, 1943, p. 55) 

In his discussions with executives, the peisonnel psychologist would like to 
state just how much benefit a selection piogiam offeis to a business, a school, 
or a militaiy force He can give a partial answei by comparing selected and 
unselected men with respect to number of failures m training, average length 
of tiaimng lequired, late of turnover, aveiage pioduction, and so on All 
these souices of evidence have been used, and all of them show that tests 
with validities m the lange fiom 30 to 50 make a considerable contribution 
to the efficiency of the institution even though they make faulty judgments 
" m many individual cases 

The best single rule of thumb foi mteipretmg validity coefficients is the 
one developed by Brogden (1949), Making ceitam leasonable assumptions, 
he showed that the benefit fiom a selection program increases in piopoition 
to the validity coefficient. Suppose the 40 applicants out of 100 who scoie 
highest on a test aie hired We can consider the average pioduction of lan- 
domly selected men as a baseline. An ideal test would pick the foity men 
who latei earn the highest ciitenon score, the average production of these 
men is the maximum that any selection plan could yield A test with validity 
.50, then, will yield an average production halfway between the base level 
and the ideal To be concrete, suppose the aveiage, landomly selected 
worker assembles 400 gadgets per day, and the perfectly selected group of 
workers turns out 600. Then a test with validity 50 will choose a group 
whose average production is 500 gadgets, and a test of validity .20 will select 
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workers with an average production of 440 gadgets The assumptions under¬ 
lying Brogden’s rule are these 

The job to be perfoimed remains the same, whether men of high or low 
ability are selected, 

Production (or other measure of benefit) has a linear relation to test 
score 

The benefit derived from a selection plan depends on the selection ratio, 
as well as on the validity of the test. The selection i atio is the proportion of 
persons tested who are accepted If there is a large labor supply, the selec¬ 
tion ratio can be very low, but when applicants are scaice the selection ratio 
may be foiced up toward 1 00 Even an ideal selection plan has no effect on 
the quality of workers when eveiy applicant must be hired. If one can pick 
and choose, average output can be much improved Figure 63 shows the re¬ 
lation of production to selection ratio and test validity for the hypothetical 
gadget assemblers used m the illustration above In this figuie we have as¬ 
sumed that among unselected workers the average production is 400 gadg¬ 
ets, and the standard deviation is 100, Tests of low validity have considei- 
able value when the selection ratio can be veiy low, when individual 
diffeiences m job performance aie large, and when small increases in pro¬ 
duction have a large dollar value 



FIG. 63 Benefit from a selection program as a function of validity and 
selection ratio, under Brogden's assumptions 


In evaluating a validity coefficient to decide whether a test is worth using 
for selection, one must ask the following questions 2 

2 The questions are so worded that an answer ot “no” indicates that tests of relatively 
low validity are likely to be helpful 



PERSONNEL SELECTION AND CLASSIFICATION 


351 


Are individual differences in job performance or other outcomes fairly 
small? 

Can we afford to dischaige or transfer to other duties men who piove to 
be unsuccessful? I e, can we tolerate “misses”? 

Is it impoitant to hire every applicant who will be satisfactory, even 
though this also involves lining many men who will fail? I e , must we avoid 
“false positives”? 

Does this test measuie an ability which is already fauly well measured by 
other tests or proceduies already m use? 

Is it possible to modify the job so that it makes less demand on the apti¬ 
tude tested? 

Is the validity coefficient much lowei than the leliability coefficient of the 
test? (If not, lengthening the test should raise validity ) 

Is administration of the test difficult and costly? 

24. In the light of the foregoing questions, how satisfactory is Deemer’s validity 
coefficient of about 65 for selecting shorthand students’ 

25 In one pilot-selection study, the predictive validity of pencil-and-paper tests 
was .64 (elimination-graduation criterion) The coefficient was raised to 69 
when apparatus tests were added Is such a small increase worth while, in view 
of the questions listed above? 

26. State employment offices use tests to guide workers into appropriate positions 
A very low selection ratio may be used, since a particular unemployed worker 
may be directed into any one of hundreds of job families In a particular in¬ 
surance agency, on the other hand, it is necessary to employ about 60 percent 
of those who apply for clerical |obs Are the same tests equally suitable in both 
situations? 

27. In which of these situations is there likely to be a fixed number of vacancies, 
and in which can the decision maker set the critical score as high or low as he 
likes’ 

a. A parole board decides which prisoners may be released 

b. An engineering school admits well-qualified applicants. 

c. A school psychologist identifies mentally handicapped children to be 
placed under a special teacher 

d. A college counseling bureau identifies clients likely to profit from psycho¬ 
therapy 

Restriction of Range Tests predict less accurately when they are applied 
to a homogeneous group Validity coefficients rise when a test is applied to a 
group with a wide lange of ability, and diop when tire test is used on a re¬ 
stricted, preselected gioup. Many studies are based on selected groups. 
Deemer, for example, did not test how well his instrument predicted short¬ 
hand learning of all girls. Instead, it was tried on girls already planning to 
take the course. Many girls of low aptitude were not included, since noimally 
those entering a shorthand course have successfully completed some work 
in typing If Deemer’s test weie applied to an entirely unscreened group, a 
higher coefficient would result. 
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The effect of screening upon validity coefficients is illustrated by the An 
Force study referred to earlier The validity coefficient of the battery foi 
pilot selection was m the neighborhood of 37 for men who met standards for 
flight training. When, for experimental purposes, a completely unscieened 
group was sent mto pilot training, the failure rate rose enormously In this 
unrestricted group, the validity coefficient rose to 66 (DuBois, 1947, pp 103, 
193). 

Investigators are frequently perplexed when a variable listed in the job 
analysis fails to pi edict the criterion of success Tire job analysis may have 
been correct m listing the ability as essential to the job, yet selection may 
have reduced its significance as a predictoi. If future applicants will be 
drawn from a similarly selected group, this variable will not help m predic¬ 
tion. But if die tests are applied to an unselected group, the variable which 
had no predictive value in the lestricted group may turn out to be a good 
predictor. For example, intelligence tests have consistently been poor 
predictors of success in teaching The explanation is obvious. Nearly eveiy 
teacher has survived years of schooling with at least adequate giades, 
which assures a fair to superior degree of intelligence (Figure 64) Among 
those so selected, differences in tested intelligence play little part in de¬ 
termining success as teachers Gi anted that an intelligence test will not help 
a school system hire teachers, an intelligence test is still a majoi factoi in ad¬ 
vising a girl in high school whether she is likely to be able to complete a 
teacher-training course Failure to recognize the effects of restricting range 
sometimes leads to discarding useful tests In 1930, Moss developed a test for 
selecting medical students which in tryout studies had good conelations 
with grades When schools selected students on the basis of scores on the 
Moss test, they began to discover that the predictive coefficients weie quite 
low Ultimately, m 1946, the Moss committee was discharged and the test 
was abandoned Then, when the test was ho longer used to select students 
so that scores again covered the full range, research studies began to report 
higher coefficients again 

28. If one were considering the probable success in industrial jobs of graduates 
from an engineering school, what characteristics would have a restricted range 
owing to preselection? What characteristics would probably not have been 
restricted? 

Contamination of Criteria. It is important to guaid against contamination 
of criteria, which spuriously raises correlations Wherever ratings are used as 
criteua, there is a possibility that teachers, foremen, or othei judges aie in¬ 
fluenced by knowledge of the prediction data Teachers may be influenced 
in their grading by knowledge of a pupil’s IQ A foreman may rate a man 
higher than his performance warrants because he knows the man has con¬ 
siderable experience. These influences raise the correlation between grade 
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FIG 64 Hypothetical data illustrating the effect of preselec 
tion upon correlation Dots show scores to be expected if every 
ninth grader interested in teaching later enters the profession 
Circles show scores of persons likely to survive gradual elimina¬ 
tion as a consequence of low school marks 


and intelligence, or lating and expenence The only way to ehmmate con¬ 
tamination is to keep piedictor data secret until all cntenon scores have been 
collected 

29 . In each of the following situations, trace how contamination might occur, and 
suggest an improved procedure to avoid it 

a. A psychologist administers aptitude tests to entering college freshmen and 
from the results predicts each student's success Success is determined after 
two years by noting which students have been dropped from school by the 
school guidance committee for unsatisfactory work. The predictions are kept 
in a locked file and not made available until the two years have passed 
The psychologist is a member of the committee but does not disclose the 
predictions 

b. Test data on pupils' intelligence, mathematical ability, and other facts are 
made available to science teachers so that they can do better teaching 
Learning in science is |udged not by ratings but by an ob|ecfive test of 
ability in science given at the end of the course. The pretests are correlated 
with this final score. 
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c. Tests for selecting salesmen are being tried experimentally Because they 
are thought to be valid, the results are given to the sales manager for his 
guidance in assigning territories to the salesmen in the experimental group 
After a year of trial, each man is |udged by the amount of his sales in rela¬ 
tion to the normal amount for his territory 

d. Flight instructor's ratings are used as a basis for promoting men from pri¬ 
mary to advanced training It is desired to check the validity of these ratings 
as predictors of success in advanced training. Advanced training is taken at 
the same field, with a different instructor This man's judgment supplies the 
criterion 

Criterion Unreliability and Bias The size of a validity coefficient is limited 
by the reliability of the critenon A low validity coefficient may be the result 
of poor cuteiion measurement lather than pool piediction Grades and lat- 
ings are paiticulaily likely to be unieliable, wheieas objective measures of 
achievement can be made very accuiate 
In many studies improvement in validity coefficients is obtained by re¬ 
fining the cuteiion lather than by continued development of the piedictors 
An example of the effects of better cntena is shown m Figuie 65. When 



FIG 65 Correlations of Navy classification tests with grades in Basic Engineering School, be 
fore and after introduction of standard achievement tests (Stmt, 1947, p 307) 


grades foi Navy classes in ship’s engine operation were based only on in¬ 
structor’s judgments, school grades had rather small correlations with predic- 
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tor tests But when two highly valid achievement tests weie used m allotting 
school giades, the classification tests weie much bettei piedictors It is to be 
noted that the subjective grades were influenced most by academic and in¬ 
tellectual abilities. When a valid measure of job knowledge and skill was 
applied as a cutenon, the picdiction lested most heavily on Mechanical 
Knowledge and Mechanical Aptitude The tests that predict a valid cntenon 
may be different from those that predict a biased, incomplete cntenon 


Necessity for Confirmation of Findings 

When an mvestigatoi has once obtained a satisfactoiy validity coefficient, 
he tends to install his progiam and stop leseaich Othei woikers, leading his 
report of the study, may accept his test as valid and put it to woik in then 
own situations This piactice is unsound In the fiist place, any validation 
lesult is influenced by chance, and conelations will fluctuate fiom sample 
to sample Consequently the test which pioves best in one sample may prove 
not to be the best prechetoi in another similar sample. Even when the re¬ 
sults are based on a huge sample, the paiticulai critical score 01 the paiticu- 
lai weights most effective in a multiple con elation are ceitam to change 
when a new gioup is tested If the same foimula is applied m othei gioups, 
the corielation is suie to diop Moreover, the supply of men and the condi¬ 
tions of training change fiom time to time It follows that the mvestigatoi 
must redeteimme the validity of his piediction technique penodically 

The weights foi a composite score, 01 the cntieal scoies in a multiple- 
cutoff pioceduie, aie determined so as to get the best possible piediction m 
the sample studied In the next sample, the same formula will have lowei 
validity We speak of this as the “shnnkage” of validity Shrinkage is likely 
to be gieat when many possible piedictois aie tiled and when weights are 
deteimmed fiom small samples Shnnkage is lelatively small when the 
piedictois aie chosen initially on the basis of substantial past experience 
and theoiy, and relatively large m a “shotgun” study where miscellaneous 
piedictois aie tued with no particular rationale 

To estimate piopeily the validity of any scoiing foimula one must cross- 
vahdate by tiying the foimula on a sample not used m selecting tests and 
establishing scoiing weights Sometimes the validity remains nearly the same 
in the second sample, but sometimes theie is consideiable shnnkage 

The term cross-validation usually indicates a second study in the same 
factory 01 school where the piediction foimula was developed The general 
psychological leadei wants to know how well the formula holds up in other 
situations This is the question of validity generalization We have seen sev¬ 
eral examples of the fact that formulas cannot be tiansfened automatically 
to new situations DAT scores had different validities in different schools. 
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The value of a formula for selecting sewing-machine operators shifted when 
the job emphasis shifted from speed to quality. Minnesota counselors went 
wiong when they gave the same weight to ACE scoies as had been used at 
other colleges No matter how well a selection procedure is validated and 
cross-validated in the original situation, it must be validated anew when it 
is carried into a new situation. Published results help only by suggesting what 
tests should be included in the tryout battery 

30. In a configural scoring formula, weights may be assigned to every variable 
and to every pair of variables How many scores are weighted (or considered 
for possible weighting) when a configural method is applied to a set of ten 
sgbtests? What does this imply about the shrinkage of configural validities? 

CLASSIFICATION DECISIONS 

The employment manager and the college admissions officer make genuine 
selection decisions They hire or admit some applicants and have nothing 
furthei to do with those they reject Classification decisions aie far moie 
numeious than selection decisions, and many so-called selection programs 
leally lead to classification decisions. A classification decision is one in 
which persons are assigned to different jobs, couises, theiapeutic treat¬ 
ments, etc The task in classification is to assign each person to the job where 
he can do best, subject to limitations imposed by the number of vacancies m 
each job category The decision makei is concerned about the subsequent 
peiformance of everyone, rather than just the persons assigned to one treat¬ 
ment, The Air Force progiam foi “pilot selection’ is really a classification 
program, because men who do not pass the tests are retained m the seivice 
and assigned to other duty 

The theoiy of classification testing must piobe into the same questions 
as the theory of selection. There are methods of combining scores for classifi¬ 
cation purposes, strategies for assigning persons to fill quotas, and so on. The 
methods differ quite a bit from those appropriate m simple selection. We 
shall not attempt to summarize these methods and the lelated theoretical 
principles, except to comment on the relation between test validity and 
classification efficiency. 3 

A test which predicts success within many jobs is a poor instrument for 
classification because it does not tell which job the person can do best The 
ideal classification test is one which has a positive correlation with per¬ 
formance in one job and a zeio—or bettei yet, negative—correlation with 
performance m other jobs A general mental test is of little value foi decid¬ 
ing which curriculum a college student should entei, even though it correctly 
indicates that he will do well in academic work 

3 For a summary of much of the theory, and further references, see Cronbach and 
Gleser, 1957, esp chtips 6, 9 
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When we apply Brogden’s assumptions to classification, we find that the 
value of a test used to assign peisons to one of two tieatments is propor¬ 
tional to its differential validity Diffeienhal validity is expiessed by the 
formula 

Sill* — S‘2f2t 

Here Si and s 2 are the standard deviations of criterion scores for the two 
treatments for landomly selected men, the two cutena being expressed m 
compaiable units such as dollar value of the workei’s pioduction r u and 
r 2( are the usual predictive validity coefficients for test t (Bi ogden, 1951) 
Looking back at Table 48 (p 341) let us assume that the criterion stand¬ 
ard deviations foi pilots and navigatois are about equal—that is, that the 
difference in value to the An Foice between an ace pilot and a bordeilme 
pilot is equal to that between an outstanding and a mediocie navigator We 
see that the Two-Hand Cooidmation Test has a validity of 26 for navigatoi 
and 30 foi pilot It theiefoie has no differential validity. Numerical Opera¬ 
tions has a validity of 26 foi navigatoi and 01 foi pilot It is theiefoie a good 
classification test The Mathematics test, with validities 50 and .08, is even 
bettei 

One of the remarkable values of diffeiential predictors is that they make 
much better use of a pool of manpowei than can a general predictoi Sup¬ 
pose we have three tests, A, B, and C Test A is a general test which has 




Single Predictor Differential Predictors 

FIG 66 Superior use of manpower by means of differential predictors 

validity 40 foi job 1 and job 2 If we want to rule out below-avei age per- 
foimeis, we accept the best 50 peicent of the men We must divide them 
randomly between the two jobs because test A has no differential validity. 
Test B has validity 40 foi job 1, 00 for job 2, and zero correlation with test 
C Test C has validity .00 for job 1, and ,40 for job 2. Now we can accept all 
men above aveiage on either B or C and assign each one according to which 
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test he does bettei on With these differential tests, 75 percent of the men 
can be used, yet each one is average 01 better in aptitude for the job in which 
he is placed 

All decisions m clinical diagnosis are essentially classification decisions, 
i e, choices between treatments Even discharging a patient is a decision 
that leturn to the community is the most beneficial tieatment for him. Theie 
are usually no quotas to be filled in clinical diagnosis, every person can be 
called “normal” or every one called “schizopln emc” if such unifoim classifica¬ 
tion appears correct Meehl and Rosen (1955) have drawn attention to the 
fact that such uniform classification is often the best strategy even when we 
arc using a test which has significant positive validity This occurs when the 
clinician is trying to identify a rare condition, Meehl and Rosen take the 
problem of piedictmg suicide as an example If a test identifies a peison 
with a high piobability of suicide, the clinician will piobably lecommend 
that he be given closer attention and more intensive tieatment than a piob- 
able nonsuicide Suicide is lare, perhaps 5 percent of those tested m a cei- 
tain clinic will later attempt suicide A peison with a low test scoie (on a 
hypothetical test) may have expectancy of only 0 1 percent of a suicide at¬ 
tempt, and one can confidently place him m the nonsuicide category With 
highei test scoies, probability of suicide mcieases, so that the test has un¬ 
doubted validity The highest scoie in the clinic sample, howevei, may indi¬ 
cate only a 20 percent expectancy of suicide. These people cannot be called 
piobable suicides If we so diagnosed them, we would make foui errois (false 
positives) foi every conect decision The special care appiopnate foi a 
piobable suicide is a gieat dram on the lesouices of a chmc It may not be 
able to mvest this effort m foui false positives in oidei to pi event one suicide 
To be sure, the clinic may aigue that one peison saved fai outweighs the cost 
of guaidmg all five, in which case persons with the highest test scoie will 
be placed m the risk-of-suicide category The principle still holds a classifi¬ 
cation test with a positive validity is not worth using if the cost of false posi¬ 
tives outweighs the benefit from hits 

To show that a test is beneficial, it is necessaiy to estimate the goodness 
of the decisions it leads to A positive validity coefficient alone is not enough 
to demonstiate piactical usefulness for institutional decisions 

31. Suppose that pilot performance is |udged to be three times as important as 
navigator performance (si = 3s 2 ). Then what tests have greatest differential 
validity for these jobs? 

32 What tests have greatest differential validity for distinguishing pilots from 
bombardiers? 

33. A twenty-point test for parole prediction gives these expectancies of violating 
parole for a score of 20, 40 percent, score 10, 20 percent, score 0, 5 percent. 
Can this test be used practically by a parole board, or should all prisoners be 
classified as likely to obey probation rules? 
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Suggested Readings 

Flanagan, John C The critical incident technique Psychol Bull, 1954, 51,327- 
358, 

Proceduies used to obtain and inteipiet critical incidents aie descnbed, to¬ 
gether with suggestions for using the mfoimahon in measurement and traili¬ 
ng 

Ghiselli, Edwin E, & Biown, Clarence W Analysis of jobs Personnel and mdustnal 
psychology (2nd ed.) New York McGiaw-Hill, 1955 Pp 17-58 
The authors suivey and critically compaie methods of job analysis which may 
be used m deciding what tests deserve tiyout 

Knchner, Wayne K, & Dunnette, Maivin D Applying the weighted application 
blank proceduie to a vanety of office jobs ] appl Psychol , 1957,41,206-208 
A simple experiment shows how a scoie derived from peisonal histoiy can 
piedict job tenuie 

Pemne, Marvyn W. The selection of drafting tiamees J appl Psychol, 1955,39, 
57-61 

A compact repoit oi a selection study on a small scale lllustiates nearly all the 
principles and pioblems of selection research This technically excellent in¬ 
vestigation was done as an undeigiaduate honois pioject 

Tiffin, Joseph, & McCormick, Ernest T Geneial principles of peisonnel testing 
Industrial psychology (4th ed) Englewood Cliffs, N J Pientice-Hall, 1958 
Pp 75-109 

Pioceduies used m validating tests foi mdustnal selection aie descnbed, 
Paiticulai emphasis is placed on the difteience between studies on present 
employees and studies conducted on new applicants, tested at the time of lin¬ 
ing but not scieened on the basis of test peifoimance The impoitauce of the 
selection latio as a factoi determining the usefulness of a test is fully explained 
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Proficiency Tests 


PROFICIENCY tests measure present peifoimance in some important task 
A test of typing proficiency evaluates job applicants, a test of duving pro¬ 
ficiency indicates whethei a peison meets standards foi a license, a test of 
proficiency m calculus determines the student’s course maik The proficiency 
test can be thought of as a sample of the cntenon In this it diflcis from the 
usual test of geneial 01 special ability, which presents novel pioblems quite 
different in form fiom the cntenon 

'Functions of Proficiency Tests 

Proficiency tests may be used foi forwai d-lookmg decisions such as selection 
and classification and also for evaluation decisions which look back at some 
completed expenence Tests which look forwai d aie often called “aptitude” 
tests, and those used to measuie educational gams aie called “achieve¬ 
ment” tests Items measunng proficiency (in, foi example, reading and 
anthmetic) aie used in both 

Measuies of proficiency, being evidence of past attainment, are among the 
best piedictois of futuie academic attainment In the California study of 
engineering (p 321), the best piedictois weie high-school lecord, a mathe¬ 
matics test, and a test of proficiency in mteipi etmg scientific data In the 
DAT guidance batteiy, the two Language Usage tests and the Numerical 
Reasoning test aie piedictois which measuie school-leained abilities 

Proficiency tests foi hmng employees aie quite varied, since each test 
must deal with the skills 01 knowledges of a paiticular job Tests foi specific 
jobs can be standaidized whenever the lequuements of jobs in several 
places aie similar A few such tests for clencal skills and shop knowledge 
have been published The wide range of tests usable in hnmg and piomo- 
tion is lllustiated by the list of tests used by Macy’s department stoie 
(Table 50) The variety of tests and the adaptation of the tests to the specific 
demands of jobs aie noteworthy We have also pointed eailiei to the use of 

360 
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information tests m selection of college students (as m the College Qualifica¬ 
tion Tests) and in screenmg of skilled workcis (bade tests, p 283) An “au¬ 
tomotive infoimation” test is a valuable supplement to aptitude tests foi 
classifying soldieis into arlillciy and aimor duties (Bunbaum et al, 1957) 
While the proficiency test has an obvious place in selection and guidance, 


V/ 

TABLE 50 Tests Used in Employee Selection hy R. H. Mocy & Company, Inc. 


Category 

Test 

Remarks 

General ability 

Wonderlic adaptation of Otis 
Self-Administering 

For nonexecutives 


Mental ability 

Difficult one-hour test for prospective 
executives 

Arithmetic 

Mental arithmetic 

For cashiers, adjusters, and various 
clericals 


Sales arithmetic 

For Telephone Order Board operators 

Office skills 

Thurstone typing 

Rough draft typing and tabulation 


Junior typing 

Simple copying 


Comptometry 

Various degrees of skill on the comp- 


1 addition 

2 addition and subtraction 

3 all operations 

tometer 


Stenography 

Three letters are dictated and tran¬ 
scribed 

Language ability 

Spelling 

For typists, correspondents, and secre¬ 
taries 


Correspondent's test 

Worksample, siJb|ecf composes three 
letters answering sample inquiries 
from customers Graded by pro¬ 
spective supervisors on composition 
and grammar 

Clerical ability 

Speed and accuracy 

Measures ability to count or locate 
numbers and letters 


Same-different 

Comparisons of written and printed 
names 

Manual Ability 

Minnesota Rate of 

For kitchen employees, packers, stock- 


Manipulation 

men 

Color vision 

Pseudo-lsochromatic Plates for 

For guards and stockmen handling 


testing color perception 

colored merchandise (eg, men's 
ties) 


Source* Information supplied by Mis Iseli Krauss of Macy’s Personnel Division All tests except 
the Wonderlic, Thurstone, Minnesota, and color vision test weie specially constructed by Macy's 


its unique function is in evaluation of treatments It is our best means of 
measunng the effectiveness of msbuetion The teacher and student may 
think of the test pnmarily as a basis foi assigning fan marks but its rnoie 
impoitant function is to indicate what the student is ready to do next, and to 
tell the teacher how well his insbuctional methods have succeeded 
Tests are such a common fixture in the schools that it is difficult to appre¬ 
ciate their influence, which can be more clearly seen in a sibiation where 
standardized testing was suddenly introduced. The Navy has long operated 
^ a vast and successful program of vocational educabon. Navy recruits receive 
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training in the various skills needed to maintain a complex fighting force, 
When educatois and psychologists were given responsibility for some Navy 
schools during World War II, one major change they proposed was the de¬ 
velopment of standard proficiency tests. The advantages the Navy found 
may be summarized as follows (Stuit, 1947, pp 287-354) 

• Tests aided in holding instruction constant in all schools preparing men 
for a given duty (e.g., torpedomen) Although objectives of instruction 
and curricula were standardized, individual schools tended to neglect or 
overemphasize particular topics Since any neglect would lowei test scores, 
the tests foiced teachers to do as the course planneis intended 

• The tests provided a basis for revising the curriculum and improving 
instruction If results showed that ceitain skills weie not mastered in the 
time allotted, it was necessary to reconsider the length of the course or the 
emphasis placed on these skills Without such tests, instructors often as¬ 
sumed that, having "covered” a topic, they had taught it Test lesults were 
of great interest to instructors and often caused them to ask supeivisois or 
specialists to suggest ways to improve their teaching 

• Proficiency scores identified classes which were not making adequate 
progress, so that supervisors could investigate the cause 

• Some tests requited the student to demonstrate job skills rather than 
metely to give verbal answers. Such tests directed the attention of instruc¬ 
tors to tire behaviors the couise was intended to produce Reliance on the 
lecture-discussion method of teaching declined, and training impioved 

• Tests which placed emphasis on all significant aspects of the course 
made sure that all-round proficiency would be considered and deficiencies 
noted In the Basic Engineering school, before the inti oduction of standard 
testmg, grades were strongly influenced by performance m mathematical 
aspects of the course, probably because this ability was easy to test The new 
examinations stressed mechanical understanding. As a result, attention was 
drawn to men who, despite skill in arithmetic, were poor in other essentials 

• In the absence of tests, grades had been assigned to students on the 
basis of subjective impressions, with, at best, the aid of teachei-made tests 
Such maiks weie umehable and subject to bias Even objective tests, made 
by teacheis without special tiaimng, measured poorly Caieful piepaiation 
of tests led to fairei and more accurate grading 

• Standaid tests permitted analysis of grading standaids and reduced 
vanation between graders When marks weie based solely on standard tests, 
grades given at diffeient places represented the same degiee of proficiency 

« Moie accuiate final maiks were a better criterion against which to vali¬ 
date selection tests 

e Motivation of students and instiuctors was improved by developing 
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rivalry based on a fair standard Showing a man his particular deficiencies 
was useful for motivating and dnectmg study. 

In effect, the progiam of proficiency testing mtioduced into personnel 
management “quality conti ol” like that imposed on manufacturing processes 
Substandard individuals were thrown back for fuither polishing, or for dis¬ 
card Substandard teaching methods weie detected and changed These 
advantages have then* counterpaits in industrial tiaming and in schools and 
colleges Maiks are unreliable, and may emphasize some aspects of the 
course to the exclusion of otheis Different instructors in the same depait- 
ment giade differently and teach with different effectiveness Teachers de- 
pait from planned curricula. But stringent control by testing can be un¬ 
desirable m general education, foi reasons to be discussed latei m this 
chaptei. 

One significant contribution of standaidized tests has been to break down 
the “time-servmg” concept of education A persons standing m school is 
frequently judged by the numbei of yeais he has put in, or the numbei of 
courses he has passed through Time spent is no index of education le- 
ceived In one study, wheie thousands of college students took standaidized 
tests of knowledge in vanous fields, many college seniors knew less than 
the average high-school senior Since number of units accumulated tells 
little about proficiency, tests aie being given mcieasing weight as evidence 
of educational development In most communities, an adult who did not 
complete high school may leceive a diploma by passing an examination and 
use this diploma to enter college if he wishes. Colleges exempt from certain 
requiied couises those who perform well enough on proficiency tests Conti ol 
by proficiency examination is widespread m professional education Law¬ 
yers, for example, must take a stale examination befoie being admitted to 
practice Psychologists wishing a diploma certifying then competence as 
clinical, industrial, or counseling psychologists take an examination given 
by the Amencan Boaid of Examiners in Professional Psychology (ABEPP) 
In this chapter we can mtioduce only the major problems of proficiency 
testing and lllusliate a few techniques that have been used The psychologist 
and the tcachei frequently have to construct pioficiency tests This is an art 
which lequues both experience and technical training Foi advice on con¬ 
struction of tests for vanous pui poses, the leadei should consult such spe¬ 
cialized books as these 

Dorothy C Adkins and otheis, Construction and Analysis of Achievement Tests, 
Washington, Government Printing Office, 1947 This discusses test constiuction 
from the point of view of the Civil Service Commission 
E F. Lindquist (ed ), Educational Measurement, Washington, Amencan Council 
on Education, 1951 This major handbook on test constiuction coveis both pio- 
cedures and theory, 
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J. R Gerbench, Specimen Objective Test Items, New Yoik, Longmans, Green, 
1956 This book discusses the use of objective tests in various fields and gives 
over 200 examples of test items developed foi special purposes It also has excel¬ 
lent bibliographies 

1 In the United States, a high-school diploma is awarded to almost any pupil who 
has stayed in the school system for twelve years. In Great Britain, where fewer 
persons complete secondary education, the evidence of "completion” is the 
General Certificate of Education, granted not by the school but by a regional 
examining body controlled by the universities. Only pupils who meet the passing 
standard on a test are given the certificate What assumptions about society 
underlie each plan? What are the social consequences of each plan 9 
2. What consequences would follow if ABEPP published the proportion of apph- 
cants trained by each university who pass the examination for their diploma in 
clinical psychology? 


VALIDITY OF PROFICIENCY TESTS 
Content Validity 

Among the four types of validity mtioduced m Chaptei 5, little has yet 
been said about content validity Content validation is pnmai lly relevant to 
proficiency testing General and special ability tests, for the most pait, em¬ 
ploy one type of content to assess ability to learn to deal with some other 
content The typical pioficiency test, on the other hand, assesses ability to 
deal with content of which tire test is supposed to be a sample, and its con¬ 
tent validity must be established 

Whereas predictive and concuirent validation judge a test by statistical 
study of results, content validity is established by logical examination of the 
test and the methods used in its preparation. The question is, How well does 
performance on the test seive as an index of performance on some defined 
“universe of situations?” The test questions are only a sample of all the pos¬ 
sible questions that might be asked, and they may or may not be lepresenta- 
tive of the total domain of appropriate questions 

Ideally the author of a test would define a universe to be measured and 
then sample his items so as to represent that content To specify the uni- 
veise, he has to define both the stimuli and the responses that concern him 
Consider first the stimuli Each of the following describes a universe of 
content in which some tester might be interested 

@ All the flags used m the U S Navy signal system. 

• All the woids likely to be read in everyday German, i.e., m newspapers, 
conespontlence, etc. 

0 All possible addition problems involving two numbeis, each of three 
digits or less. 

« All facts regarding schizophrenia given in a certain textbook 
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To specify the lesponse he intends to observe, the test developei would 
indicate what he desires the subject to do Is the subject to name the flag, 
taking as long as he needs? Or is he to recognize it rapidly? Is he to tell what 
a German woid means when he heais it? Or to lecall the Geiman word 
when given the English equivalent? 

When an autlioi has defined a umveise of content, he then can prepaie 
a sample to lepiesent that univeise For instance, he might tabulate all the 
words used m German newspapers and use this as his basic list. A landom 
sample could be taken, perhaps every 200th woid on the list The sample 
could be chosen on a repiesentative basis lathei than landomly The woids 
might be grouped according to frequency of use. the thousand most fie- 
quent, the next thousand, and so on Then twenty woids might be taken at 
random fiom each level. The lesulting sample would have an average fre¬ 
quency of use similai to that of the universe The lepiesentative sample gen¬ 
erally gives a somewhat more accuiate measure than the landom sample of 
equal size 

Foimal sampling plans aie most used to select items fox educational 
tests Spelling woids, aiitlimetic combinations, shorthand symbols, and 
othei collections of factual associations can be catalogued and sampled. In 
subjects like histoiy and science the content cannot be reduced to a list of 
specific items, but it is still possible to sample so as to lepiesent each section 
of the couise in pioper proportion. 

Sampling is sometimes veiy pooi in tests developed by an mexpeiienced 
or untrained testei. A spelling test may consist of the woids someone believes 
workers should know, lathei than woids actually used on a job or actually 
covered m a couise of study. A test in physics may ovciemphasize items on 
the paiallelogram of forces if the tester finds such items easy to invent, and 
may neglect topics wheie he lacks good ideas for items Competent test de¬ 
velopers take gieat caie to match then proficiency tests to a caieful job 
description 01 to the couise of study. 

If a test is prepared accoidmg to a cleaily described sampling plan, a 
prospective user can judge content validity veiy simply He needs only to 
decide whethei he is satisfied with the author’s choice of universe and his 
sampling mediod If a German teachei is intei ested in piepaung Ins students 
to read eveiyday Geiman, he should be content with a test based on news- 
papei vocabulaiy. If the course is intended to teach liteiaiy Geiman or 
scientific German, the same test would be less appiopnate The teacher can 
adopt it only with some risk of ch awing false conclusions The student who 
has the largest scientific vocabulary may not earn an especially high score 
on the test of newspaper vocabulaiy 

Although sampling items from a definite univeise is a pleasing and logical 
ideal, very few of the situations that concern testei s 1 educe to such simple 
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terms Most often the test constructor has some general idea of tire stimulus 
he wishes to place before the subject, but he cannot give a neat definition. 
Can an investigator who wants to measure sociability of preschool children 
catalog the situations where sociable behavior arises? Can we list all the 
human relations problems a foieman should be able to deal with? Can one 
define the universe of situations in which scientific reasoning is to be shown 
by a science student? Obviously not. 

Examining content validity therefoie requires judging whether each 
item, and the distubution of items as a whole, covers what the tester wants 
to measure This judgment rests on the test user rather than on the test au¬ 
thor. The test author can state the souices of his items, but they will rarely 
coirespond peifectly to what the testei intended to measuie How close a 
conespondence should be demanded is a highly subjective judgment, unless 
there happens to have been a correlational study showing whether one per¬ 
formance can substitute foi something a bit different. 

3 A tester wants to measure attitude toward the Negro, so as to compare schools 
having different programs of student activities What is the universe from which 
he will draw test items? 

4. Skill in the use of library reference materials is to be measured at the end of a 
college freshman course Define a possible universe of situations from which the 
test might be drawn 

5 The Morse code consists of a short alphabet of characters. The receiver must 
respond to units made up of several characters in rapid succession, the most 
difficult part of the task may be to separate one letter from the next. 

a. Describe an appropriate test for a person learning to receive ordinary non¬ 
secret communication in English. 

b. Describe an appropriate test for a person learning to receive secret (en¬ 
coded) messages of the form GFVG JHBI YGTA FBSJ. . 

6. In a written test for drivers, how could the tester decide how many questions to 
devote to speed laws, how many to safety rules, how many to interpretation of 
signals, etc 7 Would the decision reached have any effect on a [earner’s chance 
of success 7 Would it have any effect on the way he studies for the test? 


Statistical Selection of Items 

Often the test constructor supplements logical procedures with a statistical 
item analysis He gives a trial foim of the test, sepaiates gioups of students 
who do well and poorly on the test as a whole, and compares these two 
groups item by item An item on which the good students suipass the poor 
ones is judged satisfactory. One which shows no diffeience or on which the 
pooi gioup is moie successful is regaided as questionable. A disci immation 
index oi a cori elation between item scoie and test score may be calculated 
to formalize this companson 

An item shows a low item-test cori elation (l e , fails to disci immate good 
from poor students) for one of thiee reasons 
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It is so easy that nearly everyone passes, or so hard that nearly everyone 
fails 

It is ambiguous or confusing 

It measures something different from what the bulk of the test measures 

Some test constructors routinely discard any item which discriminates 
pooily. In a test purporting to sample a defined body of content, it is highly 
undesnable to drop items just because their correlations are low. 

Easy and haid items are needed if the test score is to give a fair picture of 
the proportion of the content a person knows If the easy items were elimi¬ 
nated, the average percentage score on the test would be much lower than 
that on a random sample of the content. If items on a certain topic piove to 
be very hard, this may be impoitant information for the instructor, and the 
items should perhaps be left in the test just to bring this fact home to him 
The chief value of the discrimination index is to point out ambiguities A 
badly written item which confuses good students has no place in the test 
The test constructor who exammes an item with an unsatisfactoiy index may 
see a flaw in the item which explains the errors of good students a double 
negative, a supposedly true statement which the critical reader can find an 
exception to, a too-plausible alternative answer, etc If such a flaw can be 
found, the item should be rewritten Dropping a particular item probably 
will not spoil the content validity of the test. The danger is that many of the 
poorly constructed items will fall m the same content area. When they are 
dropped tire test loses its representativeness 
The third cause of a low discrimination index is that the item differs psy¬ 
chologically from the bulk of the test. The content that a proficiency test is 
intended to sample is ordinarily mixed. Dropping unusual items “purifies” 
the test in one sense, but it no longer represents the original universe of 
content. A person might master the verbal portions of chemistry and still 
be badly confused on the quantitative parts of the course (such as balanc¬ 
ing equations). To drop the quantitative sections just because they correlate 
less with the total than do verbal items makes the test a false sample of 
the content. On the other hand, if a question correlates poorly with the total 
because it requues knowledge of a certain compound that few pupils have 
read about, the item ought to be replaced. The special element it brings in 
is irrelevant to the purpose of the test. 

Matching Achievement Tests to the Curriculum 

If a test is used to evaluate instruction, it is necessary to compare the con¬ 
tent of the test with the objectives of the instruction A test might have ex¬ 
cellent content validity as a measure of skill m arithmetic computations and 
yet be a most unfair evaluation of instruction if the course was primarily 
designed to develop reasoning abilities. Brownell and Moser (1949) wanted 
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to know which of four ways of teaching subtraction was best. Some pupils 
were taught by the equal additions method and others by the decomposition 
or borrowing method Each method was presented meaningfully in some 
classes and as a mechanical or rote procedure in othei classes Subtraction 
of two-digit numbeis (e g, 27 from 41) was the content used in the instruc¬ 
tion On a test of accuiacy m two-digit problems there was a small advantage 
for the bon owing method, meaningfully taught, but the differences between 
groups was so small that one might advise a teachei to use whichever 
method he happened to prefer. Arguing that an impoitant purpose of be¬ 
ginning instruction in subtraction is to pave the way foi later complicated 
problems, the investigators also tested the childien on subtracting three- 
digit numbeis (eg., 358 fiom 644) even though this had not been taught. 
On this test the borrowing-meaningful gioup did 50 peicent bettei than the 
other gioups, this was unquestionably the best teaching piocedure 

To take into account the variables that ought to be measuied, one has to 
find out just what the objectives of instruction are. Objectives do not end 
with factual knowledge and a limited gioup of skills When you ask a 
teacher why he is teaching his course, he lists a large number of objectives 
The geometry teacliei, foi example, denies that the pmpose of his course 
is to transmit a certain number of theorems, a few piactical punciples, and 
some skills with ruler and compass. Instead, he speaks of developing habits 
of reasoning, skill m identifying assumptions, skepticism about unvenfied 
conclusions, and so on Yet achievement tests have traditionally stressed 
specific facts and skills to the exclusion of othei important outcomes of the 
course 

In a series of studies of college and high-school teaching (Smith and 
Tyler, 1942), R. W Tyler and his cowoikers identified a large number of 
purposes schools claimed to hold. These aims may be grouped as follows 
(Raths, 1936) 

Functional information. Not mere rote knowledge, but knowledge that 
can be applied to new situations where it is relevant 

Thinking skills and habits. 

Attitudes and social sensitivity (Tolerance, spirit of scientific inquiry, 
appieciation of music, etc.) 

Interests, aims, purposes. (Vocational goals, secretarial interests, etc.) 

Study skills and work habits 

Social and personal adjustment, 

Cieativeness 

Physical health. 

A functional philosophy of life 

Before one can determine how well the student has developed these 
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qualities, it is necessary to define them m terms of behavior How does a 
person act who has “ability to diaw sound conclusions fiom scientific data”? 
Skill m mterpretation of data is shown by certain definite actions. If we give 
a skilled person a table or graph carrying unfamiliar information, he does 
certain things He identifies major trends. He disregards fluctuations that 
aie due to vanations in sampling, He concludes that factor A and factor B 
change together, not that factor A causes factoi B Having defined precisely 
what actions give evidence that the subject possesses the desired skill, it is 
a fairly simple matter to observe those actions and translate the observations 
into a measurement. 

Facts and skills loom so large m the usual classioom that teachers and test 
designers have emphasized them out of proportion to other types of out¬ 
come Although it is important to measure knowledge and skill, a pupil may 
earn a high scoie on memorized matenal and yet have made little piogress 


TABLE 51. I improvement in Abilities in Zoology, Measured at the End of the Course 
and One Year Later 



Mean Score 


Percent of 

- 




Gain During 


Beainnmq 

End 

One 

Course 


of 

of 

Year 

Which Was 

Type of Examination Exercise 

Course 

Course 

Later 

Later Lost 

Naming animal structures pictured m diagrams 

22 

62 

31 

77 

Identifying technical terms 

20 

83 

67 

26 

Recalling information 





a. Structures performing functions in type forms 

13 

39 

34 

21 

b Other facts 

21 

63 

54 

21 

Applying principles to new situations 

35 

65 

65 

0 

Interpreting new experiments 

30 

57 

64 

-25 


Sooner R W Tyler, 1934, p 78 


toward undeistandmg the course A course m cooking presumably is sup¬ 
posed to impiove ability to cook But one teacher tested a college class on 
knowledge of scientific principles undei lying cookeiy, and also had them 
cook food The quality of cooking conelated only .25 with the veibal knowl¬ 
edge (Arny, 1953, p 25) 

Studies of forgetting give weight to the argument that thinking and atti¬ 
tudes should be measuied Facts pooily understood are quickly dropped 
from the mmd, wheieas attitudes and changes of thinking habits are usually 
much more lasting Tyler gave a series of tests to one college class before 
they studied zoology, at the end of the couise, and again after a second 
year in which they studied no zoology The most lasting changes were in 
ability to apply principles to new problems and to draw conclusions from 
data (Table 51) 

In stating objectives and m designing tests it is especially important to dis- 
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tinguish ability (maximum level of performance) from typical performance 
Knowing the right answer is no guarantee that a person will behave in the 
right way It is easy, for example, to prepare a true-false test for a course m 
“how to study ” After a few lectures on principles of study, most students 
know how they should study and can pass the test. But the gap is wide be¬ 
tween what students know about study and what they do about it. Tests of 
typical behavior are needed to evaluate the effectiveness of courses teach¬ 
ing handwriting, leadership or personnel, management, resistance to prop¬ 
aganda, accuracy m arithmetic, and many other objectives Proficiency tests 
measure abilities produced on demand To evaluate instruction fully, it is 
necessary to supplement proficiency tests with observations and other meas¬ 
ures of typical behavior (see Part Three). 

7. For one of the following courses, try to list all the important objectives: 

a. Study of literature in the junior high school. 

b. A course to train union officials for collective bargaining 
e. A course to train junior executives in human relations 

8. Define each of the following objectives in terms of specific behaviors: 

a. “To train young people for wise parenthood " 

b. “To increase appreciation of good literature " 

c. “To prepare young people for the duties of citizenship." 

9. The Brownell-Moser test of ability to do a performance that had not been taught 
is a "transfer" test 

a. Is it fair to judge the student's learning by asking what he has not studied? 
b A man is trained to repair certain models of a radar set What would be a 
suitable transfer test and what would be learned from it? 

C. It is claimed that study of French improves English. Would it be useful to in¬ 
clude an English test in a research study on a new method of teaching 
French? 


Construct Validity 

The listing of various objectives implies a list of distinct kinds of behavior 
Are “thinking skills” really distinct from “functional information”? This is a 
problem of construct validation. 

Tylei’s evidence that applying principles and interpreting experiments 
aie much less subject to forgetting than factual mfoimation indicates that 
these abilities are distinct. In a more extensive study of fourteen college 
courses he found that the correlations between different types of proficiency 
were quite small. Even after correction for errors of measurement, the cor¬ 
relations were (Judd et al , 1936): 

Knowledge of facts vs application of pi mciples, about 45 
Knowledge of facts vs inference from experiment, 35 
Application vs. inference, .40 
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10. In a certain course, the correlation between a factual test and a test of ability 
to apply knowledge to new situations is .40 Assume that grades are assigned 
as follows 10 percent A, 20 percent B, 40 percent C, 20 percent D, and 10 
percent F What grades will A students on the first test receive if the second 
test is used as a basis for grading? (Use the scatter diagram on p. 114) 

Effects of Item Form on What Is Measured. Tests having the same “content” 
may measuure different abilities because of variables associated with 
item foim Reading ability, for example, affects scores on almost all achieve¬ 
ment tests A valid measure of knowledge is not obtained if a peison who 
knows a fact misses an item about it because of verbal difficulties, The Navy 
Mechanical Knowledge Test contamed four types of item mechanical facts, 
tested verbally, mechanical facts, tested pictorially, electrical facts, tested 
verbally, and electrical facts, tested pictorially. Similarity of content pro¬ 
duced lower correlations than similarity in form (Table 52) In other words, 

TABLE 52. Correlations of Tests Having Similar Form and Tests 
Having Similar Content 


Correlation 
Corrected for 
Correlation Unreliability 


Tests similar in form, different in content- 


Verbal tests mechanical vs. electrical 

63 

.79 

Pictorial tests mechanical vs electrical 

.64 

.86 

Tests similar in content, different in form 

Mechanical, verbal vs. pictorial 

.61 

.71 

Electrical verbal vs. pictorial 

51 

74 

Tests different in both form and content 

Mechanical verbal vs electrical pictorial 

.49 

63 

Electrical verbal vs mechanical pictorial 

.45 

.59 

Kuder-Richardson reliability coefficients 

Mechanical verbal 

89 


Mechanical pictorial 

.82 


Electrical verbal 

.71 


Electrical pictorial 

67 



S ounce Conrad, 1944 


the form of the items largely determined the score received. Another study 
provides even stionger evidence that the verbal element in tests may be un¬ 
desirable Tiaming of Navy gunners had been validly evaluated by scores 
made m operating the guns As an economical substitute, verbal and pictorial 
tests were developed. Identical information was tested m the two forms, 
the same question being asked m words alone or by means of pictures 
supplemented by words Questions dealt with parts of the gun, duties of the 
crew, appearance of tracers when the gun was properly aimed, etc. The 
pictorial test had a correlation of .90 with instructors' marks based on gun 
operation whereas the vahdity of the verbal test was only .62. The verbal 
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test was m large measure a reading test, it correlated 59 with a Navy read¬ 
ing test, while the picture test correlated only 26 with reading (Training 
Aids Section, 1945) 

Speed is relevant and important m tests of typing attainment or reading 
facility, oi in tests of arithmetic for cashiers. Speed is irrelevant when we 
wish to know how large a pupil’s vocabulary is, how much science he 
knows, or how accurately he can reason Speeding can usually be justified 
in proficiency tests only if the test is intended to pi edict success m a task 
where speed is helpful. 

Many popular testing techniques are strongly affected by response styles 
A i espouse style is a habit or momentary set which causes the subject to 
earn a diffeient scoie from the one he would earn if the same items were 
presented in a different foim In true-false tests particularly, some people 
have the habit of saying "true” when in doubt, while otheis aie characteristi¬ 
cally suspicious and lespond "false” when m doubt If the tester has included 
a large proportion of due statements in his test, the acquiescent student will 
earn a high scoie even if his knowledge is limited Othei lesponse styles 
include tendency to gamble, woiking for speed lathei than accuiacy, and 
use of a particular style m essay tests 
Aptitude tests aie also affected by lesponse styles, though to a lesser de¬ 
gree than proficiency tests (Cionbach, 1950) In one of Tliurstone’s spatial 
tests, the student is to mark all the figures m a row of six which are just like 
a given figure save for being rotated Some students consistently maik many 
figuies m the row, while some maik only one figuie even when seveial aie 
correct This caution, or lack of thoroughness, loweis scoies. The Seashore 
pitch test requires subjects to judge whether the second of two tones is 
highei or lower than the first Some students are strongly biased towaid 
one of the two answers, m one class of ten students, the most biased student 
marked 75 items H and only 25 L Aftei the class was given a shoit talk on 
the nature of bias, their scores impioved. This particular student gained 14 
points (on a 50-pomt scale from pure chance to peifect) 

For measuring ability, multiple-choice or best-answer tests aie distinctly 
preferable to tests having fixed response categones such as true vs. false or 
agree vs. disagree The best-answer test is not only virtually immune to re¬ 
sponse biases other than tendency to gamble but is especially well adapted 
to testing of comprehension 

11. A mental test uses items like the following- 

sweet-sour . SAME-OPPOSITE 

obscure-lucid . . SAME-OPPOSITE 

occult-mystical . , SAME-OPPOSITE 

What response styles is such a test affected by when given with a time limit? 
Design a test for the same ability which would be less influenced by response 
style 
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Recognition vs Recall A major issue in educational testing is whether 
recognition tests and lecall tests on the same content measure the same 
ability Multiple-choice and other recognition items are necessarily given 
great emphasis m standardized testing because they aie easy to score. This 
has been a source of concern to teachers who feel that only tests requumg 
free responses can measure adequately what they teach. Especially where 
the puipose of teaching is to produce ability to recall or invent new solutions, 
teachers tend to prefer free-response tests. The English teachei prefeis to 
judge a student fiom a sample of his free writing, lather than on tests where 
he meiely identifies enors The mathematics teacher feels that his students 
should be required to solve problems, rathei than merely to select alterna¬ 
tives in what one writer calls “place-youi-bet” questions 

To evaluate this aigument requiies an experiment to determine whether 
the lecogmtion and recall tests rank subjects in the same way The lesult 
of this experiment depends on the ability measmed In anthmetic, the two 
rankings coirespond closely, at the othei extieme, penmanship peiformance 
has negligible conelation with ability to recognize good wiitrng In college 
mathematics, multiple-choice questions had reliability coefficients and coi- 
relations with giades m later mathematics essentially the same as those for 
free-answei questions (College Entrance Examination Board, 1946). 

One might think that ability to generalize from data could be tested only 
by requiring that the student form his own generalizations But a test re¬ 
quiring undergraduates to identify the best and pooiest generalizations from 
a set of data con elated 85 with ability to draw generalizations directly 
fiom the data. Planning an expenment is a cieative function, yet a recogni¬ 
tion test calling for choice among alternative plans coirelated .79 with a 
free-response test of ability to make plans (R W Tyler, 1934, pp. 27-30), 

It seems likely that free-iesponse tests can be superior to recognition tests 
where one is requiied to measure very accurately Among giaduate students 
who have oveileamed the verbally stated principles of scientific method, 
probably all would do well on any reasonable objective test of expenmental 
design But even among such students there is marked variation in in¬ 
ventiveness in attacking new problems, and a long, carefully scoied free- 
response test may be the best measure Similarly, an objective recognition 
test sorted students accurately on French pionunciation (Tharp, 1935) But 
in an advanced gioup it is doubtful that fine disci lmination between those 
with authentic and those with false accents could be obtained by anything 
but a peifoimance test 

The most serious charge against recognition tests is that they have often 
been confined to measurement of simple, even trivial knowledge of facts It 
is possible, as many examining bodies in universities have demonstrated, to 
devise objective questions which call for deep comprehension and subtle 
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reasoning, Recognition tests are by no means limited to simple mental proc¬ 
esses The difficulty is that ingenuity and effort are needed to prepare a 
penetrating objective test, whereas a taxing (if not necessarily valid) essay 
question can be scribbled off in minutes. 

12. A course in psychological testing is intended to prepare students to perform the 
skills listed below For which would a recognition test be acceptable? 

a. Selecting a test battery for a college counseling bureau. 

b. Administering and scoring the WA1S 

c. Drawing proper conclusions from a validation study. 

d. Making proper interpretations of technical terms used in test manuals. 

13. What is the relative importance of free recall and recognition of correct re¬ 
sponses in 

a. learning to interpret children's problem behavior in terms of probable 
causes. 

b. learning to play bridge 

14. Discuss the following comment by a newspaper columnist 

“I view with same misgivings the purely utilitarian course in 'Communications' 
which has been substituted for the traditional freshman composition course at 

S_Students' needs in this course, we are told, are ascertained by the 

administration of ‘batteries of tests ’ I venture to assert that nothing will be 
learned from these tests which a skilled teacher would not find out from a single 
theme and a half-hour interview, and that these would be better for the student 
psychologically, as motivation for the course, than the 'batteries of tests' " 

Taxonomy of Educational Outcomes In contrast to the vast effort to map out 
aptitudes by factor analysis, there has been almost no systematic research 
on pioficiency variables Tyler’s correlational studies barely illustrate the 
questions to be asked How many distinct types of outcome are there m a 
particular content area? Is scientific inference a general ability, or specific 
to one science? Are the correlations between outcomes the same at all ages? 
How do the relations between outcomes depend on the method of instruc¬ 
tion? And so on Research on such questions remains utterly fragmentary 

Proficiency testing is usually an ad hoc effort to satisfy a practical need: 
a test for torpedomen in tins Navy school, a test for clerks in this paiticular 
civil service agency, a test of French for freshmen in this college The test 
developer moves from one such assignment to the next, never pausing to ask 
fundamental psychological questions Tests marketed nationally undergo 
extended developmental research, but that research is more concerned with 
removing ambiguous items and developing accurate norms than with clarify¬ 
ing the nature of proficiency. 

The structure of proficiencies has been less appealing as a research prob¬ 
lem than “the structure of mental abilities.” It is obvious that the correlation 
between proficiencies depends upon what one has studied. Factual knowl- 
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edge of college physics correlates with skill in mathematics, if only because 
most physics students also take mathematics When he investigates whether 
an “aptitude” such as mechanical comprehension is con elated with mathe¬ 
matical reasoning, the psychologist often has the illusion that he is dealing 
directly with the natural organization of the mind That this is an illusion is 
shown on the one hand by the substantial correlation between the TMC and 
the biographical inventory, and on the other hand by theoietical research 
such as Piaget’s The aptitudes treated by the factor analyst are no less 
dependent on experience than are the proficiencies How mathematical un¬ 
derstanding of science, aftei training, is 1 elated to undeistandmg of the con¬ 
crete aspects of science is fully as challenging and uigent a problem as any 
question about piedictive abilities. 

The only major advance m conceptualizing pioficiency variables has come 
from a logical rather than an empincal investigation A gioup of specialists 
in educational testing, most of them univeisity exammeis, has developed 
a “taxonomy” of educational objectives This is a grand index of all the 
variables which mstiuctois and educational testers have suggested meas¬ 
uring for the puipose of evaluating mstiuction The vanables are classified 
logically, these gioupmgs piovide hypotheses that ceitam types of behavior 
are psychologically similar (foi example, that they might be developed by 
similar teaching methods) 

As outlined m Table 53, the taxonomy has six major sections Knowledge, 
Comprehension, Application, Analysis, Synthesis, and Evaluation 1 The 
abilities are listed in an appioximate order of complexity, sections are also 
subdivided to sepaiate more and less complex piocesses One must compre¬ 
hend something befoie he can apply it, generally speaking, and he must be 
able to analyze elements befoie he can analyze organization The taxonomy 
gives a complete definition of each category and illustrates the category 
with several educational objectives and several pages of test items 

The taxonomy has considerable value in improving communication be¬ 
tween testers and mstiuctois It offers a standaid vocabulary for discussing 
testing pioblems and provides a soit of checklist so that evaluators can recog¬ 
nize whether they have listed all the objectives that ought to be measured 
At present, the taxonomy is limited to “cognitive” performance, 1 e, to 
knowledge, comprehension, and reasoning 

The lllustiative test items for measuimg higher mental processes are of 
unusual interest We can select only a few illustrations here, beginning with 
a simple factual item falling m category 1 12 (knowledge of specific facts) 

1 Table 53, Figure 67, and the test items in this section are taken with minor modifica¬ 
tions from Bloom (1956) Copyright 1956 by Longmans, Green and Company and repro¬ 
duced by permission 
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TABLE 53 Synopsis of the Taxonomy of Educational Ob|ectives 


1 00 Knowledge Remembering something previously encountered 
1 10 Knowledge of specifics Recall of bits of concrete information 
1 11 Knowledge of terminology 
1 12 Knowledge of specific facts 

1 20 Knowledge of ways and means of dealing with specifics Includes methods of inquiry, 
chronological sequences, standards of judgment, patterns of organization within 
a field 

1.21 Knowledge of conventions accepted usage, correct style, etc 

1.22 Knowledge of trends and sequences 

1 23 Knowledge of classifications and categories 
1 24 Knowledge of criteria 

1 25 Knowledge of methodology for investigating particular problems 

1 30 Knowledge of the universal and abstractions in a field Includes organization of ideas 

by means of theories 

1 31 Knowledge of principles and generalizations 

1 32 Knowledge of theories and structures {as a connected body of principles) 
2,00 Comprehension Understanding of material being communicated, without necessarily 
relating it to other material 
210 Translation from one set of symbols to another 

2 20 Interpretation Summarization or explanation of a communication 
2 30 Extrapolation Extension of trends beyond the given data 

3 00 Application The use of abstractions in particular, concrete situations 

4 00 Analysis Breaking a communication into its parts so that organization of ideas is dear 

4 10 Analysis of elements E g , recognizing assumptions 

4 20 Analysis of relationships 

4.30 Analysis of organizational principles E g, recognizing techniques of propaganda 

5 00 Synthesis Putting elements into a whole 

510 Production of a unique communication 

5 20 Production of a plan for operations 

5 30 Derivation of a set of abstract relations 
6,00 Eva/uafion Judging the value of material for a given purpose. 

610 Judgments in terms of internal evidence E g, logical consistency. 

6.20 Judgments in terms of external evidence E g, consistency with facts developed else¬ 

where 


Number of annual rings at the base of the trunk of an old tiee is 
gieatei than 

less than the number of rings half-way up the trunk 
the same as 

Knowledge of methodology (1 25) is also at the level of sheer recall, but the 
content is more general. For example: 

Fossils on rocks constitute valuable clues to the past Some of these fossils are 
identical with animals existing today How does this affect the investigation of 
geological histoiyf (Choose one) 

a Such fossils make the work much simpler since they can be easily traced 

b These fossils are rare and therefore do not weaken the overall results veiy 
much 

c These fossils are extremely valuable since observation of then living counter- 
paits yields much information as to climates and physical conditions of the 
geologic past 

d The existence of living counterparts of fossils is immaterial since only the 
fossil itself is important 
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An item is not classified entirely by its content, since the piocesses the stu¬ 
dent uses will depend on his experience. In a biology course which has been 
studying about fossils, this is likely to be a recall item, but m a general sci¬ 
ence couise which has not touched on fossils, this item lequires application 
of scientific method to a new problem In the taxonomy, the item would 
have to be classified as a measuie of application (3 00) for the general sci¬ 
ence student 

(A) (B) (C) 

HHHH HH HHH 

II I I II III 

H—C—C—C—C—H H—C—C—OH H—C—C—C— H 

I II I II II 

HHHH HH HH 

H—C—H 

I 

H 

(£» (E) 

HHHH H O 

I I I I I / 

H—C—C=C—C—H H—C—C 

I I \ 

H H OH 

1 The compound which can neutralize bases and foim salts 

2. The hydiocaibon which has the least tendency to “knock” among 
those listed above 

3 The compound which decolorizes biomme and potassium peiman- 
ganate. 

FIG 67 Item testing comprehension of organic chemical formulas rather than memory alone 

Compieliension items go beyond lecall and ask the student to i estate ma¬ 
terial The item in Figuie 67 requues matching oigamc chemical com¬ 
pounds with then propeities The compounds are lepiesentative of familiar 
types (e g, B is an alcohol), but the student is not expected to know the 
specific chemical foimulas given This item is classified as tianslation (cate¬ 
gory 210) since the fonnula must be lecognized as equivalent to the verbal 
definition of an acid, etc 

The following application item (categoiy 3 00) calls for fiee response 

John prepaied an aquaiium as follows He carefully cleaned a ten-gallon glass 
tank with salt solution and put m a few niches of fine washed sand. He looted 
several stalks of weak elodea taken fiom a pool and then filled the aquarium with 
tap watei After waiting a week he stocked the aquarium with ten one-inch gold¬ 
fish and three snails The aquarium was then left m a coiner of the room After a 
month the watei had not become foul and the plants and animals weie in good 
condition Without moving the aquanum he sealed a glass top on it 

What prediction, if any, can be made concerning the condition of the aquauum 
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after a period of several months? If you believe a definite piediction can be made, 
make it and then give your leasons If you are unable to make a piediction foi any 
reason, indicate why you aie unable to make a piediction (give your leasons), 

The items m categones 4 00 to 6 00 tend to run considerably longei. The 
student may read an aigumentative passage and be asked to tell the func¬ 
tion of a certain sentence (4 20, analysis of relationships) oi be may listen 
to a recorded musical selection and answei questions on the development 
of the themes (4 30, analysis of organizational principles) “Synthesis” ordi¬ 
narily requnes fiee lesponse, e g, one item asks tire student to develop a 
chemical process to satisfy given specifications. As a final example, we cite 
part of an “evaluation” item (600). The student is to suppose he reads 
some suiprising statements about language by an Otto Jespersen, and is to 
tell whethei each of the following facts would lead him to tiust Jespersen’s 
statement, or to distrust it, or would have no significance 

a Mi Jespersen was Professoi of English at Copenhagen Univeisity 

b The statement m question was taken from the veiy flist aiticle that Mi Jes¬ 
persen published 

c Mr. Jespersen’s books aie fiequently lefeired to in othei woiks that you con¬ 
sult 

The taxonomy is an impiessive analysis of constructs which perhaps 
describes the way intellectual piocesses aie organized Many decades ago, 
however, the collapse of faculty psychology taught psychologists to be sus¬ 
picious of purely logical categorizations Though the faculty psychologists 
had defined innumerable independent poweis of the mind such as memory, 
judgment, and leasonmg, they could find no way to measure these powers 
sepai ately or to distinguish them from general adaptive ability The cate¬ 
gories of the taxonomy refei, not to hidden mental poweis, but to observable 
abilities to solve specific types of pioblems These abilities aie obviously 
measuiable. Nevertheless, one must ask whethei the categories describe sep¬ 
arate types of behavioi If the tests given diffeient names coi relate highly, 
the distinctions are of little value, and if tests within the same category cor¬ 
relate little, the gioupmg is aitificial, 

The one recent study of the oiganization of intellectual skills appeared 
puor to taxonomy Fuist (1950) administered 27 tests coveung several 
subject fields to two gioups of students at the stait of the eleventh grade, 
and again late m the twelfth grade. Within each subject there were tests of 
factual knowledge, judgment of relations, application of principles, etc One 
group m a private experimental school was taught by a method emphasizing 
integration of couises and development of higher mental processes. The 
other group, from public high schools, was taught by more formal methods, 
the content areas being sharply separated from each other Furst determined 
whether the tests of different intellectual processes had different correlations 



PROFICIENCY TESTS 


379 


at the start and end of the experimental period, whether the tiainmg pro¬ 
gram affected the pattern of correlations, and how highly tests measuring 
similar intellectual processes were correlated The total study involved 1600 
correlations, and only a superficial resume of results can be given here 
Despite the differences in the two educational programs, the conelational 
patterns for the two groups were nearly alike, The most important finding 
is that tests dealing with the same subject area had higher intercorrelations 
than tests dealing with the same mental process In Table 54, based on 


TABLE 54. Correlation Among Proficiency Tests Categorized 
by Subiect Matter and by Mental Processes 



Average 
Correlation 
Within Group 
of Tests 

Average 
Correlation 
with Tests Not 
in Group 

Subject-matter 



groupings 



English 

48 

32 

Humanities 

28 

23 

Social studies 

.45 

35 

Physical sciences 

45 

31 

Mathematics 

55 

34 

All categories 

44 

31 


Mental-process 

groupings 


Critical thinking 

38 

32 

Recall of information 

25 

31 

Reading 

52 

39 

Language expression 

44 

29 

Application of principles 

.24 

29 

Interpretation of data 

33 

28 

All categories 

36 

34 


Souhce Furst, 1950 


the public school, the average correlation among a group of tests is compared 
with the average correlation of those tests with all other tests The former 
value must be higher before one can argue confidently that tests within the 
group measure some common ability Subject-matter gioupmgs cleaily meet 
this requirement, mathematics tests or science tests do have more m com¬ 
mon with other tests of the same subject than they do with tests outside the 
subject. This is to be expected Knowledge of scientific facts, application of 
scientific principles, and interpretation of scientific data are developed in 
the same class, and the same pupils who do well in that class tend to do 
well on all tests in that subject The evidence on mental-pi ocess groupings 
is essentially negative. There is no general "ability to apply principles”, the 
tests of application in various subjects actually have less m common with 
each other than they do with tests of other processes. Likewise, Furst found 
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little evidence for a geneial ability to think ciitically or to interpret data 
This study leaves considerable need for further information The coirela- 
tion between tests m the same field is not high, evidently the various tests 
within the science field, for example, do measure somewhat distinct abili¬ 
ties But we have little knowledge as to why some students develop one 
scientific ability more than another, and little understanding as to how in¬ 
terpretation of data in science differs psychologically fiom interpretation 
of social data 

15. a. List several outcomes which might be considered in evaluating proficiency 
in algebra 

b. Classify these outcomes according to the taxonomy. 

c. What empirical questions might be asked about the relation between these 
several proficiencies? What value might this information have m designing 
subsequent tests, in altering instruction, and in guiding students? 

16 In question 15, substitute clinical psychology for algebra, and answer the same 
subquestions 

17. Would a logical taxonomy of psychomotor abilities have led to the same re¬ 
sults as factor analysis? 

18. The following quotations from want ads specify proficiencies that an employer 
might want to test by interview, written test, or other methods Locate the pro¬ 
ficiencies in the taxonomy as well as you can. 

a. “Wanted: Young man for advertising agency with 'a flair for writing ’ " 

b. “Wanted Senior marketing research analyst, thoroughly familiar with 
customer testing procedures." 

19. What might a person studying psychological testing learn that would fall in 
each of the following categories of the taxonomy 1.11, 1 21, 1 23, 1 24, 1 31, 
2 30,4 10,5 20, 6 20? 

PUBLISHED TESTS OF EDUCATIONAL OUTCOMES 

Among the myriad tests which have been published for measuimg pupil 
accomplishment, some are concerned with single subjects like history or 
science Batteries of achievement tests have subtests measuring several 
important areas of school attainment The subtests aie standardized to¬ 
gether, so that one can compaie the pupil’s lelative standing in one subject 
with his standing in other subjects. The aicas most commonly measuied m 
elementary-school batteries include leading, spelling, language usage, 
arithmetic, and social studies. On the whole, because of the way schools are 
organized, single-subject tests have been moie widely used at the high- 
school and college level than comprehensive batteries, but tests of general 
educational development aie now widely used for selection of students 
and for guidance 


Tests of General Educational Development 

If one’s purpose is to determine how much science a student contemplat- 
mg a premedical course knows, his general scientific competence is of more 
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interest than his mastery of a particular subject. A few recent test batteries 
for high school and college attempt to measure general educational develop¬ 
ment without legard to nariow subject-mattei divisions One important use 
of such tests is to evaluate a man’s readiness foi college (Dressel and 
Schmid, 1951) GED tests weie first designed to assist men returning from 
mihtaiy service to leenter the educational system at the appiopriate level, 
legardless of the amount of foimal credit they had received Lindquist, a 
designei of the original tests, indicates their philosophy (1944, p 366), 

The real ends of instruction are the lasting concepts, attitudes, s kills , 
abilities, and habits of thought, and the improved judgment or sense 
of values acquiied, the detailed matenals of instruction—the specific 
factual content—aie to a large extent only a means toward these ends 
Smce the detailed mateiials out of which a self-educated serviceman 
might have developed his . . thinking might differ considerably . 
fiom those used in foimal classioom mstiuction, we felt that . . we 
must try to measuie as directly as possible the ultimate outcomes of a 
general education, and to minimize as much as possible the foimal 
pedagogical procedure that may be used to attain them in classroom 
mstiuction 

GED batteries measure mathematical ability and English expression by 
rathei conventional items, but in science, social studies, and literature, in¬ 
stead of testing what scientific facts oi works of hteratuie student is familiar 
with, the battery uses “tests of interpretation.” He is asked to read a passage 
resemblmg those m college science texts, and then is tested for comprehen¬ 
sion. Similarly, he is required to interpret social science mateiials and pas¬ 
sages fiom liteiature. The test draws on knowledge but requires few specific 
facts It should be noted that these tests are measures of general education, 
i.e., of proficiencies that may apply to a wide range of future expenences. 
Readiness foi a specific course (eg, college zoology) depends both on gen¬ 
eral intellectual development and also on specific attainments from piereq- 
uisite courses The latter are measured by proficiency tests m particular sub¬ 
jects 

For teaching pui poses, measures of overall proficiency are not sufficient. 
The teacher needs to know specific strengths and weaknesses of each pupil, 
and diagnostic tests provide this information Diagnostic tests focus on the 
piocess by which the student responds, rather than the pioduct. Diagnostic 
procedures m reading will be described below Since they stress analysis of 
the individual’s errois lather than companson between students, diagnostic 
proceduies are rarely standardized. 

Early proficiency tests have measured knowledge and routine skills, neg¬ 
lecting higher intellectual processes. Recent tests have paid more attention 
to complex intellectual skills such as interpretation of experiments. Many 
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of these are “transfer” tests requiring application of skills and ideas to situa¬ 
tions not studied Tests of ability to apply principles ask the pupil to solve 
unfamiliar problems using the principles he has learned If a pupil can solve 
a pioblem he has not studied and defend his solution with a sound scientific 
principle, it is certain that he undei stands the principle The TMC is in ef¬ 
fect a test of application of principles of mechanics. Another example is the 
“aquarium” item above. 

20. If admission to college depends in part on a test which stresses knowledge of 
historical facts, what instruction given high-school seniors would improve their 
chances of passing 7 What instruction would help them most if admission is 
based on a test of interpretation of social studies materials? 

21. Discuss the argument "The GED tests of interpretation are measures of intelli¬ 
gence and reading ability rather than of educational development in sub|ect 
fields." 

Important Educational Achievement Batteries 

The following list of educational tests is by no means exhaustive, but it 
covers many prominent types including those a psychologist or counselor is 
most likely to encounter Most of the batteries are divided into parts which 
can be given separately where measurement in only one area is required 

® California Achievement Tests, Ernest W Tiegs and Willis W. Claik, 
California Test Buieau, 1933, 1950, 1957. Grades 1-2, 3-4, 4-6, 7-9, 9-14. 
Earlier edition known as Progressive Achievement Tests. A comprehensive 
three-hour batteiy yielding separate scores m vocabulaiy, reading compie- 
hension, anthmetic reasoning, arithmetic fundamentals, English mechanics, 
spelling, these scores have reliabilities .79-95 (Grade 6) Large differences 
between a pupil’s subtest scores are significant The pioposed “diagnostic 
analysis” based on small subgroups of items is not a dependable basis for 
studying learning difficulties Norms are derived fiom a representative na¬ 
tional sample also used foi normmg the CTMM, thereby permitting ac¬ 
orn ate comparison of achievement for each pupil with that of pupils having 
similar general ability (see below) 

« California Tests m Social and Related Sciences, Geoigia Sachs Adams 
and others, California Test Bureau, 1946, 1953 Grades 4-8, 9-12 A test of 
three parts, yielding six scores with reliability 85- 95 (6th grade). Four sec¬ 
tions cover history, geography, and other social studies, two sections cover 
science content. Items at the elementary level test knowledge of the impor¬ 
tant facts and central concepts of the typical program in general social 
studies and science The advanced level deals specifically with American 
histoiy (four sections), and with a mixture of factual and reasoning ques¬ 
tions selected from various science courses. 

9 Essential High School Content Battery, David P Harry and Walter N 



PROFICIENCY TESTS 


383 


Durost, World Book, 1951, Grades 10-13. A three-and-one-half-hour battexy 
measuring knowledge in mathematics, science, social studies, and English. 
Each section covers specific course content lathei than geneial comprehen¬ 
sion Foi example, the mathematics test includes algebiaic factoring, recog¬ 
nizing graphs of conics, and lecallmg theorems about perpendicular chords 
Other problems cover everyday aiithmetic reasoning, use of tables, etc The 
science section surveys factual and vocabulary knowledge and also meas¬ 
ures ability to reason from principles to conclusions The single score for 
each area is less analytic than the finer subdivision given in ITED or STEP 

• Evaluation and Adjustment Series, Walter N Duiost (ed), Woild 
Book, 1.950. A series of tests for high-school use, each test with a different 
author and for a different subject (Examples- Anderson Chemistry Test, 
Davis Test of Functional Competence in Mathematics, Engle Psychology 
Test ) The several tests vary m quality, but each of the better tests repre¬ 
sents a comprehensive suivey of outcomes regarded as mipoitant by spe¬ 
cialists m the field The chemistry test covers principles such as valence 
and photosynthesis, practical applications, mterpietation of experiments, 
chemical formulas, and quantitative problems. Norms are for students who 
have had one yeai of chemistry In general, tests m this series are well de- 1 
signed foi end-of-year evaluation of attainment in basic courses. 

© Iowa Tests of Basic Skills, E F Lindquist and A N Hieronymus, 
Houghton Mifflin, 1940, 1956 Grades 3-9 A battery requiring about five 
hours, yielding scores on vocabulary, reading, arithmetic, language, and 
work-study skills, each having a reliability 90 or ovei (Grade 6) Norms are 
based on carefully selected national samples for each grade eaily m year, at 
midyear, and at end of year Each test contains sections of mci easing diffi¬ 
culty, pupils in any grade take only those sections appropriate in difficulty 
for them, but questions for adjacent grades overlap All sections lequire use 
of skills in meaningful contexts. The section on woik-study skills measuies 
ability to read maps, graphs, and chai ts, and ability to use reference mate¬ 
rial, indices, etc 

® Iowa Tests of Educational Development (ITED), E F Lindquist and 
others, Science Reseaich Associates, 1942, 1952 Grades 9-13 An eight-hour 
battery of nine tests designed to measure geneial educational development 
in skills and thinking abilities, regardless of particular courses or content 
studied Scores include understanding of basic social concepts, inteipieta- 
tion of reading materials in social studies, use of sources of information, 
quantitative thinking, correctness and appropriateness of expression, etc 
Hie tests are carefully normed, reliabilities range from 81 to .94. The battery 
pi edicts college grades with validity near .60, this high validity being attrib¬ 
utable in part to the length of the battery “Secuie” versions of ITED, of 
various lengths, are used m scholarship competitions, and m the American 
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College Testing Program, which obtains information on high school seniors 
for use by college admissions officers. 

• Metropolitan Achievement Tests; Geitrude H. Hildieth and others, 
World Book, 1931, 1946, 1959 Grades 1, 2, 3-4, 5-6, 7-9 The elementary 
level provides nme scores in three hours of testing, measuring vocabulary, 
reading, arithmetic, and language usage At higher levels, tests of study 
skills and information in science and social studies aie added 

a SRA Achievement Series, Louis P Thorpe, D Welty Lefever, Robert A. 
Naslund, Science Research Associates, 1954, 1957 Giades 2-4, 4-6, 6-9 A 
seven-hour battery measuiing woik-study skills, readmg, language usage, 
and arithmetic (Other tests in prepaiation ) An attiactive test using story 
matenals to measure fundamental skills in meaningful contexts. Designed to 
give accurate end-of-year measuies foi average and able students Retarded 
pupils earn such low scores on the test foi their grades that foi accuiate 
measurement they should be retested on the next lower level of the senes. 

a Sequential Tests of Educational Piogress (STEP), vanous authors, 
Cooperative Test Service, 1958 Grades 4—6, 7-9, 10-12, college A battciy of 
seven tests, each lequmng ninety minutes to obtain a score with reliability 
83 to .91. Norms can be compaied to scores on the School and College Abil¬ 
ity Tests In reading, quantitative ability, science, and social studies, the stu¬ 
dent is required to compiehend and diaw conclusions about complex selec¬ 
tions, realistic pioblems, unfamiliar expenments, etc, the tests thus require 
a deeper mastery than many skill or content tests do One novel subtest is a 
measure of listening comprehension foi passages read by the teacher. An- 
othei is an ingenious objective test of ability to judge and improve writing 
style This battery is particulaily likely to encouiage teaching for under¬ 
standing 

• Stanford Achievement Test, Truman L Kelley and others, Woild Book, 
1923,1943,1953 Grades 1-3, 3—4, 5-6, 7-9 The sevcial revisions of the Stan¬ 
ford test have been more widely used than any other achievement batteiy. 
Revisions have greatly improved the norms and score conveisions without 
radically altering the test content There are five scores at the primary level 
(80 minutes) and eight at the advanced level (215 minutes), two-thirds of 
the reliability coefficients are ,88 or better The skill tests (paragraph mean¬ 
ing, arithmetic computation, language, etc.) contain caiefully written items 
The problems are very similar to those used for practice in traditional les¬ 
sons The social studies and science sections ask miscellaneous, unrelated 
factual questions; modem educational theory regards understanding of cen¬ 
tral concepts in these fields as more important than such recall of isolated 
facts 

22. Some achievement test manuals report no information on score mtercorrela- 
tions What value is there in knowing not only the reliabilities of the California 
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Mathematics Reasoning and Mathematics Fundamentals scores (91 and 93, 
respectively) but also the intercorrelafion (.77)? 

23. The GED tests require up to two hours to obtain a single score, whereas the 
California battery obtains about twenty scores in two hours What differences 
in viewpoint underlie these different practices'? 

24. Some test manuals encourage teachers to study answer sheets to determine 
what items each individual student missed The Iowa Tests of Educational De¬ 
velopment, however, discourage this practice, saying that such analysis does 
not provide a dependable basis for individual diagnosis Why not? 

25. Some achievement tests have end-of-school-year norms, whereas others report 
norms for the beginning of the school year For what purpose is each type 
better suited"? 

Norms for Educational Tests 

Grade Norms. When standardized tests weie first introduced, the manuals 
began to translate law scoies into “giade equivalents ” These equivalents are 
somewhat analogous to the mental age A “grade equivalent” of 6 0 is as¬ 
signed to the score the average beginning sixth-giadei makes Just as mental 
ages have pioved to be an unsatisfactoiy and misleading system for gen¬ 
eral mental tests, so grade equivalents have pioved unsatisfactory for report¬ 
ing educational development 

Grade noims are based on samples of pupils thioughout the nation Some 
sections of the countiy aie far supeuor to otheis, because of differences in 
pupil ability, differences m the quality of teachers, and chlfeiences m ex- 
pendituies for education No teacher or superintendent fiom a superior 
school can take pride if his group merely leaches the national norms, no 
one fiom a handicapped school distnct should be condemned if his group 
cannot attain the national aveiage The only fan basis for comparmg schools 
is to judge each school against schools with similar organization, similar cur- 
licula, and similar promotion policies Raiely aie published norms based on 
such meaningful segments as “New England public elementary schools, in 
cities with population 2000 to 10,000” oi “Southern ruial elementary schools.” 

Norms aie not “standaids ” It is a common mistake to assume that all pu¬ 
pils m the ninth grade should reach tire ninth-grade noim. This is of couise a 
fallacy, 50 percent of the pupils in the standardizing sample fall below the 
norm Furtheimore, the test shows only what schools are doing at present 
It is highly unlikely that the schools are doing so well that the national aver¬ 
age repiesents what pupils could attain with the best teaching methods The 
teacher whose class reaches the aveiage has no cause foi complacency 
Theie is much loom for the development of better educational methods 

Grade noims are based on grossly unequal and artificial units of measure¬ 
ment It looks as if a pupil is greatly superior if he reaches the “ninth- 
grade level” m science when he is only in Grade 6 But m many standard 
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tests such a difference in score represents very little difference in ability, be¬ 
cause the aveiage lises only slightly from Grade 6 to Grade 9. In Figure 68 
we see curves for two tests of the Stanford Achievement series in which 
grade noims are compared to statistically derived "K scores.” We cannot here 
consider the assumptions made m deriving the K scores, but the general 
implication of Figure 68 would hold foi almost any scoring method In 
Language, giade increments above Grade 7 imply only veiy small improve- 



FIG 68 “True” increases in ability corresponding to equal changes in grade scores for two 
subtests of the Stanford Achievement Test 


ments m performance In Social Studies, on the other hand, a three-grade 
gam represents a large increase m knowledge 

“Ninth-grade levels” in different subjects are not equally hard for sixth- 
graders to reach The pupil who is “two years beyond his grade” in a subject 
may sometimes be markedly superior, at other times this standing is equaled 
by a large proportion of his class. 

One further serious limitation of grade equivalents arises when conversions 
foi very high and low scores are derived statistically rather than by direct 
observation. In a test battery intended for Grades 4 to 6 the author measures 
pupils m those grades and then may determine by extrapolation what score 
“ought to correspond” to the Grade 2 or Grade 8 average This is sometimes 
done just to save effort m standardization, and sometimes because it is im¬ 
possible for second-graders to take a sixth-grade test 
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Grade norms imply that two pupils with a grade equivalent of 7 0 are sim¬ 
ilar, even if one is m Grade 4 and one in Grade 9 This is just as unsound as 
assuming that an MA of 12 means the same thing for a 9-yeai -old and for a 
14-year-old. The advanced pupil and the retarded pupil with the same score 
make different errors, and they are by no means ready for the same type of 
instruction Grade norms can lead teachers and parents only to unsound con¬ 
clusions and should be replaced by percentile scores based on single-grade 
groups in a defined type of school, or by some similar system. We may ex¬ 
pect the giade norm to remain m use long aftei all test specialists agree on 
its inappropriateness, just as it took a long time to displace the ratio IQ 
Teachers and school administrators are used to it and look for it, the pub- 



Anticipated Reading Comprehension Score 


FIG 69 Expectancy chart for California Achievement Test m Grade 7 (Re¬ 
designed from the chart presented in the CAT manual, 1957) 


lisher of a new test feels that he must satisfy this demand, and the vicious 
cucle rolls on 

Expectancy Norms The employer who uses a proficiency test to screen out 
poor performers is concerned only with law scores. The educator, however, 
wants to know if the student is making as much progress as he should He 
therefore wants to evaluate performance relative to ability, and gam over 
performance at the beginning of tiaimng. The propel procedure compaies 
the pupil to the normal expectancy for his ability. The technique is illus¬ 
trated by the expectancy charts developed for the California Achievement 
Tests. The expectancy chart shows what proficiency score can be expected 
for pupils with each score on a mental test Figuie 69 shows a simplified ver¬ 
sion of the chart for reading comprehension in Grade 7 The tester enters 
at the left with the pupil’s MA from GTMM The line indicates the normal 
achievement for pupils with his general ability For example, a pupil with 
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MA180 (15 years) is expected to earn a reading score of about 54 Any lower 
score indicates that he is performing below his ability. 

One great advantage of the expectancy chart is that it enables the teacher 
to evaluate the attainment of his group even if it is not typical m mental abil¬ 
ity A teacher who finds that end-of-year peifoimance is below average 
usually dismisses the finding if he knows that the gioup was weak to 
start with. The chart can show whether this class is peiformmg as well as 
did comparable weak pupils m the norm gioup 

26. What reading score is expected, according to Figure 69, for a seventh-grader 
whose CTMM MA is 14 years’ 

27. Might there be an advantage in preparing separate expectancy charts for 
pupils from lower-class homes? 

28. Tests of the Evaluation and Ad|ustment Series are provided with expectancy 
charts in which expectancies are shown as a function of IQ. Is this more or less 
satisfactory than the use of mental age as in Figure 69’ 

Conversion Scales for Recording Progress There are obvious advantages foi 
the school progiam in using the same tests from yeai to yeai, though it is 
necessary to use more difficult tests as the pupil advances through school. It 
is also advantageous to standaidize tests foi the seveial subject areas on the 
same or compaiable gioups Publishers of educational tests have invented 
single conveision scales which can be used for all the tests they publish, 

The usual method for articulating consecutive tests at diffeient levels is 
the equipercentile technique (see p 93) Suppose a reading test has two 
levels, one for Giades 7 to 9 and one for Grades 9 to 12, Both tests may be 
given to a laige sample of ninth-graders, and percentile scores may be de¬ 
termined The results will look like this: 


80th percentile 
50th percentile 
20th percentile 


Lower-Level Upper-Level 
Raw Score Raw Score 


62 49 

48 33 

32 18 


Then a raw score of 49 on the upper-level test is legarded as equivalent to a 
raw score of 62 at the lowei level This permits measuiement of giowth for a 
pupil who is given the lower form m Giades 8 and 9 and the upper foim in. 
Grade 10 


29. On the test discussed above, interpret the growth shown by a pupil whose 
eighth-grade score (lower level) is 32, ninth-grade (lower level) is 48, and 
tenth-grade (upper level) is 35 


Reading Tests 

Reading deserves special attention in this book for two reasons. First, 
reading tests have been developed m greater number and variety than any 



PROFICIENCY TESTS 


389 


other type of achievement test and demonstrate numerous problems in test 
construction Second, they are used more wrdely rn gurdance and cluneal 
exarnrnatrons than other achievement tests 

Definition of Abilities. At a glance, readrng seems to be a clearly defined 
skill which could readily be measuied, but tests having the same name 
measuie quite different behaviois Authois disagree on what leading tests 
should include and on the most useful definition of rate, comprehension, 
woid knowledge, etc., for testing puiposes. One author examined 24 reading 
tests and found that between them they measured 48 diffeiently labeled 
skills (Traxlei, 1941) This does not mean that leading involves 48 specific 
abilities, however One test claimed to measure seveial “entirely different” 
leading skills, but coirelations showed that these scoies actually measured 
the same function ovei and ovei undei different names In factor analysis of 
25 tests of reading and study skills, the following common factois weie 
found' tendency to read caiefully (an attitude oi habit), inductive reason- 
mg, late of reading, veibal ability, vocabulaiy, late foi disconnected facts, 
and chait leading (W E. Hall and F P Robinson, 1945). In view of such 
vanety of test content, the peison who needs a leading test must be careful 
to define what leading ability he wishes to measuie 

Survey Tests. Survey tests are ordinal lly intended to assess geneial level of 
reading development They aie used to screen pupils for remedial teaching, 
to predict success in couises, and to check whethei poor reading explains a 
pool score on a gioup mental test 

Reading development includes both speed of leading and compiehension, 
and a useful test must considei both these elements Most testers have tried 
to measure the two aspects of performance independently, but they have 
been largely unsuccessful. This pioblem occuis in most testing, but larely is 
it so obvious as m reading' when an act has seveial integral aspects, one 
camiot divide the act into fiagments for testing purposes 

In theory, the way to sepaiate speed and comprehension is to hold one 
constant while the othei is measured Speed can be minimized by giving a 
test without a time limit, the subject’s understanding of what he reads then 
should be a measuie of compiehension alone Rate is much haidei to isolate 
because every peison has many reading rates, which he changes with his 
puipose and with the matenal read (Blommeis and Lindquist, 1944) The 
usual device for contioiling comprehension is to requue the subject to an¬ 
swer questions about what he has lead, oi to cross out absurdities as he 
reads 

The most piominent reading tests for the lowei grades are those con¬ 
tained m standard batteries (also capable of being administered separately) 
For college guidance and screening to detect remedial needs, there are sev¬ 
eral caiefully developed tests including tire Survey Section of the Diag- 
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nostic Reading Tests (see below), the Davis Reading Test (Psychological 
Corporation), the Cooperative Reading Test (Educational Testing Service), 
and the Kelley-Greene Reading Comprehension Test (World Book). 

The following limitations should be borne in mind in evaluating reading 
survey tests' 

• The scores on rate and comprehension are often interdependent, so that 
the subject can raise one at the expense of the other When only a single 
"rate of comprehension” score is obtained, thoroughness may lower the sub¬ 
ject’s rate scoie 

9 Time-limit tests supposed to measure comprehension often aie strongly 
influenced by rate of reading Such tests have little diagnostic value, al¬ 
though they may be good predictors of school success 

• The reading test covers only a selected range of content, yet reading 
ability varies somewhat with different materials. Some people can read his¬ 
tory well but not science, some do well on stones, poorly on textbooks. Dif¬ 
ferent content is appiopnate for different testing purposes. 

• Many tests measure only a limited type of comprehension The skilled 
readei must be able not only to follow sentences but also to take the mam 
idea from a long passage, put together ideas from separate sentences, follow 
a logical argument, and so on Some reading tests measure only the simplest 
comprehension, whereas others demand deep and thorough interpretation. 

Dia gnostic Methods A diagnostic pioficiency test at its best is an impressive 
tool With or without such tests all teachers and school psychologists must at 
times deteimme why students are having difficulty. An ideal diagnostic 
reading test calls attention to every aspect of the reading process wherein 
the pupil might have stumbled Checking off one at a time the many sorts 
of possible enor, the tester is left with a picture of the specific weaknesses 
diat must be lemedied before the pupil can make normal progress 

Such a diagnostic procedure must be based on extensive research to de¬ 
termine the common types of errors Once the errors are listed, it is neces¬ 
sary to devise test procedures to leveal which eirors the pupil makes Sys¬ 
tematic diagnostic methods have been worked out for anthmetic and a few 
other school subjects, but they have reached their highest development in 
reading Reading specialists have available a great variety of diagnostic tech¬ 
niques, and a few of these have been oigunized into battenes sufficiently 
simple for nonspecialists to use. Among the widely known methods is the 
Durrell Analysis of Reading Difficulty (Durrell, 1940) 

Durrell based his tests on study of the readmg errors made by 4000 school 
childien The tests piovide an opportunity to observe the child at work in 
oral and silent reading, and in special tests. The first tests deal with oral 
readmg The tester records the time required to read the standardized para¬ 
graphs, and notes errors as they occur. Silent readmg is then checked on a 
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set of paragraphs of difficulty equal to the oral senes Questions are used to 
check recall, and the teacher observes such reading habits as hp movement 
A flash-exposure device is used to show woids bnefly, tins detects percep¬ 
tual habits and enois Finally, there is a phonetic mventoiy for children who 
have difficulty in woid perception The analysis is not a mechanical device 
—it calls for keen observation by the tester. In the oral tests, the tester must 
lecord phiase reading, hesitation on words, mispionunciation, omission of 
words or syllables, neglect of punctuation, and enunciation. The vutues of 
the test are that it presents materials of standardized difficulty and that the 
checklist of enors calls the tester’s attention to all the significant facts 
The type of information that comes fiom a caieful diagnosis is lllustiated 
by Durrell’s report on Anthony, age 9-8, m the fourth grade His Bmet MA 
was 9-4, but his general reading achievement was at the low second-grade 
level. 

On the Duirell Analysis of Reading Difficulty, Anthony made a low second- 
grade scoie on oial-readmg tests, but seemed quite unable to keep his attention on 
silent reading He did pooily on quick perception of words, and had no method of 
word analysis He read a word at a time in a strained voice and a monotone He 
was markedlv msecuie in his reading and repeated words continually He was un¬ 
aware of the enors in his reading, indicating a lack of concern about meaning 
When his errors weie corrected m his oial reading, his comprehension was excel¬ 
lent 

The silent reading was maiked by a high rate at the expense of masteiy He 
skipped all the hard words, as a result his recall was scanty and maccuiate, al¬ 
though he did the best he could with it Strictly speaking, he cbd not lead silently 
at all, since his leading was accompanied by constant whispenng of the woids, 
vague sounds being given for the difficult woids His eye movements in silent lead¬ 
ing were irregular and unrhythmic, with seven to ten per line and many regiessive 
movements. 

Much simpler diagnostic tests are designed for group administration 
These contain subtests presumed to measure various types of reading ability. 
Performance is represented as a profile showing the relative strengths and 
weaknesses of the pupil. The Diagnostic Reading Tests contain a survey sec¬ 
tion and a diagnostic battery to be applied to pupils who do poorly on the 
survey The diagnostic batteiy designed for Grades 7 and above offers the 
following scores. 

Vocabulary in special areas 

English grammar and literature 
Mathematics 
Science 
Social studies 
Comprehension 

Silent reading of textbook matenal 
Comprehension of similar material read to the student 
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Rate of reading 
General 
Social studies 
Science 
Word attack 

Oral. An individual test for observing speed and enois in attacking new material 
Silent. A group test of skills such as syllabication p 




TESTS OF SKILLS IN PERFORMANCE 

Development of skilled performance in a repetitive task is the goal of in¬ 
struction m typewriting, comptometei opeiation, shopwoik, bluepnnt read¬ 
ing, chess making, and industrial training Measurement of maximum abihty 
to peiform is based on the principle of the worksample One lates a sample 
of the work produced, or observes and judges the performance itself Many 
of the methods described can also be applied to the study of typical behavior 
on the job. Methods discussed in Chapters 17 and 18 are especially designed 
to assess typical performance 


Product Rating 

For product rating, we must compaie specimens of the best woik of each 
person. To compare people, it is desnable to have them work on similai ma¬ 
terial. One standard test in stenography accomplishes this with a lecoided 
dictation which the subject must take down and transcribe This method 
holds constant not only the difficulty of matenal but also the speed of dicta¬ 
tion and clarity of speech McPherson (1945) standardized a test m simple 
woodworking by requiring each boy to constiuct a wood block like a model. 
The block was designed to demand use of saw, drill, and chisel. Scoung was 
done objectively by imposing a plastic pattern on the block to check dimen¬ 
sions. 

Objectivity in scoring is aided by a checklist or latmg scale This forces 



1 2 

3 

Score 

Appearance 

1 Shriveled 

Plump and slightly moist 

1 

Color 

2. Pale or burned 

Well browned 

2 

Moisture content 

3 Dry 

Juicy 

3. 

Tenderness 

4 Tough 

Easily cut or pierced with fork 

4. 

Taste and flavor 

5. Flat or too highly seasoned 

Well seasoned 

5 


6 Raw, tasteless, or burned 

Flavor developed 

6 


FIG 70* Score card for rating sample of cooking (Clara M, Brown ef a/, 1946) 
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judges to notice the same features of each sample and to use a comparable 
numerical scale Two product rating forms are illustrated m Figures 5 and 70. 

Observations 

Observations or measures of active perfoimance are needed when the 
product is not an adequate index of a skill The civil seivice typing test is 
such a measure, indicating both speed and quality of performance Some 
tests use legular factory or shop equipment, while others use special appara¬ 
tus Use of regular equipment is illustrated by a test of ability of packers in 
a cannery Pioduction on the job could not be used as a test, because of fac- 
tois varying from day to day and because the job is normally effected by 
teamwork. For test puiposes, one conveyor belt was set aside and one 
workei at a time assigned to it A count was made of the number of cans he 
packed pei hour (Stead et al, 1940, p. 86). 

Special equipment is used to obtain a works ample where regular equip¬ 
ment cannot be used, because of either cost oi clanger. It is essential for 
eveiy submanne ciewman to learn to use the escape hatch of his ship in case 
it should sink m shallow watei. The only sure test of ability is to have him 
try to use it, but it is obviously impossible to make the test at sea. To test 
(and to tiain) crewmen on shore, a replica of the escape hatch was built in 
a deep tank. Since tins test repioduces all essential features of sea condi¬ 
tions, a valid worksample is obtained 

Motion pictures have occasionally been used to assess ability to observe. 
Ability of aerial navigatois was tested by showing them a motion picture, 
taken from a plane, giving a view of the ground and of the essential instru¬ 
ments. Aided by a map, students were to make a plot just as they would in 
flight. This, like many worksamples, proved to have little reliability (Carter, 
1947). Low reliability is charactenstic of woiksamples where one error may 
disturb the entire sequence of peifoimance, and several samples of perform¬ 
ance must therefore he obtained The more successful motion-pictuie tests 
usually include a large number of shoit, similar items, rather than a few com¬ 
plex sequences of performance (Gibson, 1947) 

Observations are fai from trustworthy Men with experience m administer¬ 
ing worksample tests were asked to record facts about the performance of a 
man sharpening a drill point, as shown m a film, on two occasions one month 
apart (Siegel, 1954). Even though die questions dealt with readily observa¬ 
ble facts (e g., Did die man wear goggles while gundmgp), the raters’ an¬ 
swers on the second occasion agreed with their fiist answers only 82 percent 
of the time (50 percent being chance expectancy) 

Evaluation of performance is improved by recording systematically wbat 
the subject does. Mechanical recording devices are especially valuable 
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where a performance is rapid or where subtle details are important. An ex¬ 
ample from industrial training is provided in Lindahl’s study of disk-cutter 



operation (1945) The difference between good and poor performers was 
found to lie m the speed with which they went through each phase of the 
cycle of operation The operation called for pressing a pedal to drive the 
cutting wheel, and releasing it for a new cut. Lindahl devised the lecordmg 
device shown m Figure 71, which yielded records such as that in Figure 72. 
These objective records showed which workers were the best producers and, 
moie impoitant, what errors each was making. The records also provided a 
means of teaching the worker what errors he was making and helping him 
recognize the “feel” of the pedal when he was doing the act correctly 

30. Outline a plan for obtaining product ratings and performance observations 
for each of the following situations. In each case, discuss the relative merit of 
the two procedures 

a. Testing a boy's knowledge of how to wire batteries in series 

b. Testing the improvement in technique of a concert violinist 

c Testing ability to operate a calculating machine for all types of operation. 

STANDARDIZED PROFICIENCY TESTING IN SCHOOLS 

Standardized tests are a comparatively modern innovation. Fifty years ago 
teacheis around the country taught in their own ways, set their own expec¬ 
tations of then pupils, and assigned giades independently If the average 
iourtli-gradei in Mill Corners could outiead sixth-graders down the road at 
Pinetown, that fact was never brought to light. Paients and employers 
looked at the performance of school graduates and, according to their dis¬ 
positions, weie pleased or displeased with the results. There was no sound 



9 Hours 



45 Hours 



141 Hours 



239 Hours 

FIG 72 Improvement in the foot-action pattern of a 
trainee The record at the top shows long pauses between 
strokes, uneven speed during the cutting (downstroke), and 
[erky foot action at the end of the stroke All these faults 
were eliminated in the final record (Lindahl, 1945), 
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basis for judging whether the school had taught as well as might reasonably 
be expected 

The first systematic comparison of school attainment was made by an edu¬ 
cational crusader, J. M. Rice (1897). He was convinced that the pressure for 
perfection in certain accomplishments was leading to faulty emphasis in 
education, and he prepared a spelling test to collect evidence for an article 
on the subject. His test, given in 21 scattered cities, showed that the test 
scores of eighth-graders were about the same m all cities regardless of the 
time devoted to spelling Although children in some cities weie superior 
spelleis during eaily grades, presumably because of stiess on that subject, 
such diffeiences vanished by the end of schooling Rice hoped to con¬ 
vince teachers that they could reduce the time spent on ioimal skills, saving 
more time for an enuched cumculum Ironically, the testing movement 
which he fathered tended instead to chain the schools to limited cunicula 
and to increase the emphasis on a few skills 

Educators weie quickly impressed with the advantages of determining 
whether schools were "up to standard,” and tests of reading and arithmetic 
weie prepared and widely used Tests in other subjects followed. In some 
cities, at the height of the enthusiasm for standard tests, every pupil was 
given a nationally distributed examination each June in neaily every course 
he studied. Despite the marked benefits conferred by tests, the testing craze 
eventually produced serious dislocations in the school program, 

The Navy program described at the stait of the chaptei demonstrated that 
tests are a powerful instrument foi administrative conti ol of the classroom. 
The tests show which teachers are bringing their gioups “up to standard,” so 
that admimstiatois can take piompt remedial action The fact that tests can 
be used m this mannei constitutes an obvious thieat Even if a teacher 
knows that no one in his school system has ever been discharged or repri¬ 
manded after his class made a poor showing, the desne to make a good lm- 
piession on his superiors will cause him to take the tests seriously. The 
teacher relieves his anxiety by making a greater effoit to teach effectively, 
and by putting piessure on his pupils to work hardei. This increase in effoit 
on both sides might increase the amount pupils learn, but it has fiequently 
raised tension m the classioom to an unhealthy level. It is one thing for a 
teacher to demand thoiough pieparation and woik of good quality, it is quite 
anotliei for the teacher to whip his chaiges on so they will “make the best 
record in the school,” 

Admimstiatively imposed tests not only intensify the effoit in the class- 
i oom, they channel that effort until teaching can become entirely a matter 
of preparing for the examination The situation in New Yoik State has been 
described as follows (Brickman, 1946): 



PROFICIENCY TESTS 397 


The New York State Education Department, better known as the 
Regents, administers uniform examinations, also better known as the 
Regents, semiannually to all high-school students pursuing key subjects 
To prepaie their students for this ordeal, many teachers abandon the 
regular textbook in favor of a special booklet containing a review of the 
subject and a repunt of recent Regents’ examinations. The geneial prac¬ 
tice is to begin the review about four to six weeks in advance of the big 
test, although some teachers start Regents preparation as early as the 
fust day of the term. 

Such concentration may be reasonable if the test measures what the pu¬ 
pils ought to be learning, but it severely restricts education when die test 
covers the wrong outcomes or coveis only a few of the desired outcomes 
The writer recalls visiting a rural school which was alaimed because all the 
boys, upon finishing the compulsory eighth grade, left school to work on the 
farm The principal believed they should stay m high school, but the boys 
considered school a waste of time A look at the “literature book” for 
Grade 8 supplied one clue to the difficulty Pupils were being held to selec¬ 
tions about a Hindu boy and his village, mountain climbing m Tibet, and 
odier topics of remote intei est When the teacher was asked why she did 
not encourage the boys to develop their language skills on bulletins from the 
agriculture extension service diat the boys would consider valuable, her 
answer was. “I know this book isn’t good, and die boys don’t like it, but I 
have to teach it because it prepares pupils on topics coveied in the standard 
test given at the end of the year by the County Office ” 

A test (or set of tests) is said to have “cuiricular validity” if it represents 
the objectives of the cuniculum the pupils have studied. Instiuction should 
not be identical m all classrooms of a given giade Even wi thin the same 
class, it may be proper for different pupils to woik on different skills at a 
given time. A standardized test necessarily fits one particular set of objec¬ 
tives and one paiticular body of content. Uniform instructional aims may be 
assumed in Navy training, eveiy torpedoman must learn the same things no 
matter what school tiains hun In public schools there is much less justifica¬ 
tion for umfoim content Eveiyone would agiee diat elementaly-school pu¬ 
pils should leam certain basic concepts about society and the community 
(for example, interdependence of communities and nations) One school 
might approach tins by a survey of local industries Anotiier might de¬ 
velop the same concept with a unit on Great Butain Peihaps a school m 
Texas would find pupils more responsive to a unit on South America AH 
of these programs would aim toward the same goal, and yet their content is 
so different that no one test fits all three approaches 
The same problem arises even m fundamental skills such as spelling and 
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arithmetic, where objectives are more definite Some teacheis develop spell¬ 
ing incidentally to instruction in other subjects. Drill on long lists of words, 
as Rice first pointed out, produces very small permanent gams. It is reason¬ 
able to suppose that pupils will leam spelling just as well if they master 
words they have occasion to use But when the teacher knows that the test 
will be a random sample of words from a “standard word list,” he cannot 
hope to make a good showing by concentrating on words that pupils misspell 
in writing about South America. One published spelling test, for example, 
uses words such as anxious, foreign, vitamins , biscuit, admission, etc. The 
only way to insure a high score on a test like diis is to have a daily spelling 
drill with a miscellaneous list of words In arithmetic, all teachers cover die 
same content, but there are wide differences in opinion regaidmg the ap¬ 
propriate timing of a particular topic Should fi actions be introduced in the 
third giade? If the test will include such items, the teacher is likely to hy 
to squeeze it in even though it would be wisei to put extra time on short 
division Conversely, if the test omits fractions, a teacher hesitates to spend 
time on them even when a class is interested in fractions and ready for such 
work 

Tests have effects on the pupils also The pupil learns that "what really 
matteis” in any course is what shows up on the tests Anythmg the teacher 
introduces which will not be tested is likely to be regarded as a side show, A 
mathematics teacher may try to show the similanty between geometric 
postulates and the piemises hidden in advertising appeals or political de¬ 
bates. Students who know that their tests will cover mathematics alone may 
be enteitamed by the digression, but they will not study the mateiial By the 
time of college, the student is keenly alert to the fact that tests cover only 
part of the course, and is sure to focus his study on what he thinks he will be 
examined on 

Because of increased recognition of these problems, there was a swing 
away from wholesale, administratively imposed testing in the years follow¬ 
ing 1930 Many school testing piogiams became inadequate with respect to 
both the amount of information collected and the use made of it, The na¬ 
tional concern with educational quality, bi ought to a peak by the successful 
launching of a Russian satellite m 1957, revived public and piofessional con¬ 
cern with tests as a means of quality contiol President Eisenhower, speak¬ 
ing shortly after the Russian success, suggested that a national examination 
might be the best way of raising educational standards The educational 
legislation adopted by Congiess in 1958 made special piovision for state 
testing progiams, and a repoit prepared to guide such progiams (Identifica¬ 
tion and Guidance of Able Students, 1958) said that achievement tests are 
‘the most valuable single testing investment for the statewide program.” 

The chief risk in using standard tests as a means of quality control is that 
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they will discourage teachers from introducing untraditional material or 
trying new methods The test is likely to focus attention on those outcomes 
easiest to test, to the neglect of attitudes, originality, and complex ideas. 
Choosing and using tests wisely can overcome these difficulties without sacri¬ 
ficing the benefits that standardized testing can offer. There are four ques¬ 
tions to answer m plannmg such a program 1 When should the tests be used? 
What should be tested? Which tests should be selected? How should the re¬ 
sults be used? Of these questions the last is paramount. 

The proper function of a school test is to improve die educational program. 
It may do so by helping plan learning expenences for a pupil, by indicating 
ways to improve teaching, or by building attitudes m pupils and teachers 
which will promote better teaching Once this point of view is accepted, it 
follows diat tests are initial, not terminal, paits of the educative process 
Theie is little merit m testing after it is too late to profit from the results For 
this reason, more and more schools are using achievement suivey tests at the 
begmning of the school year When the results of suitable tests are placed 
in the hands of the teacher m September, they piovide a sound basis for 
planning the year’s work There is no argument against testmg again in June 
to measure improvement, but in fall testing the emphasis is on diagnosis and 
curnculum plannmg rather than on marking and recrimination 

In guidance, tests are used for the pupil rather than on him They show 
him his weaknesses, and are a more effective argument for his taking certain 
couises or changing certain habits than is pressure from the teacher. In 
guidance testmg, it is important to minimize competition and concern over 
the effect of tests on marks In the most successful programs, the pupil takes 
the tests because he wants to know the results. 

It follows that the tests have to measure something of importance Some 
schools will seek to measure acquisition of subject matter Odiers will be 
more concerned with educational development defined less m terms of spe¬ 
cific knowledge and more in terms of skills such as interpretation of data 
In geneial, it appears that the most useful standardized tests are those which 
cover highly general objectives rather than those covering specific content 
A test of ability to reason from a scientific principle to a conclusion about a 
strange situation is a fair test for almost anyone It is not necessary for the 
student to have studied either the specific situation or the principle, if he 
can think scientifically, he can draw the correct conclusion, The GED tests 
calling for ability to interpret new leading selections likewise measure pro¬ 
ficiency regai dless of what the person has studied 

Tests should not be the sole deteiminer of the pupil’s mark. Equal atten¬ 
tion should be given to locally constructed tests of objectives not covered in 
the standard instruments, and to evidence the teacher has collected from 
the pupil’s continued class performance. 



400 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


It is important to interpret scores in the light of the backgiound of the pu¬ 
pils and of the school program The fact that a school is “behind” the national 
norms is no cause for alarm The reasons for the lag need to be found, but 
they may not justify any change m the school progiam Foi example, a 
school which takes in many pupils from Spanish-speaking families may quite 
properly decide to spend most of its effort during the first two years on de¬ 
veloping English vocabulaiy, even if this delays instruction m arithmetic. A 
fifth-grade class winch has been enthusiastically composing original stones 
and poems need not be criticized if grammar has been neglected in the 
process. Such evidence would suggest extra effoit on grammatical usage at 
some other time, but would suggest changing the fifth-grade woik only if the 
teacher thought it possible to improve foimal usage while at the same time 
developing creative abilities 

Achievement testing has been detrimental when it foiced schools into 
training rather than educating pupils So long as tests aie considered in the 
light of the pupil’s past development and as a guide to futuie instruction, 
they need have no harmful results. They will have to be improved to meet 
these new demands adequately. Tests of limited validity may serve tolera¬ 
bly as impaitial maiking instruments But when a test bears the responsibil¬ 
ity of describing what a pupil knows and can do, and what he needs to at¬ 
tain, it will have to meet a high standard of validity. It is in tins direction 
that impiovement is to be anticipated 

31 If a teacher knows that a test containing items such as the following will be 
given at the end of the fifth grade, how will it influence her social studies 
instruction? (Items from Stanford Achievement Test, copyright 1952, World 
Book Co , and used by permission.) 

1, A chief food of Eskimos is 

fish vegetables fruits cereals 

2 A man who works with wood is a 

plasterer carpenter plumber painter 

3 Each star in the United States flag stands for a 

state city president battleship 

4. A large ranch in a mountainous area is most likely to sell 

wool milk vegetables chickens 

5. The great pioneer leader in Kentucky was 

Boone Clark Marion Carson 

6 The invention of the steam engine made possible the invention of the 
reaper locomotive sewing machine Bessemer converter 
7. A popular amusement in ancient Rome was 
soccer chariot racing cricket golf 

32 What effect on the high-school social studies curriculum would be expected 
to follow if tests of ability to interpret data (charts, graphs, government re¬ 
ports, etc) were given annually to all pupils? Would this effect be beneficial 
or harmful? 
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33. Most states require high-school students to study American history as a way 
of developing their proficiency as citizens Would it be beneficial or harmful 
to give every pupil a test of historical information based on a random sample 
of persons and dates in American history? 

34. A writer says, "As a general rule, no achievement test printed or revised more 
than five years ago, or any other test more than ten years old, should be 
used.” Do you agree? 

35. Are there school subjects in which the content in a certain grade ought to be 
uniform for all schools? 


Suggested Readings 

Katz, Martin R. Selecting an achievement test principles and piocedures Pimce- 
ton Educational Testing Service, 1958 

This thirty-page brochure (available without charge fiom the publisher) 
coveis the map considerations in selecting tests for school puiposes In addi¬ 
tion to a leview of reliability and validity as they apply to achievement tests, 
the author considers school characteristics which affect the choice of tests 
and gives advice on how scores should be mteipreted 
Noll, Victoi H Objectives as the basis of all good measuiement Inti eduction to 
educational measurement Boston Houghton Mifflin, 1957 Pp 90-107 
This chapter, fiom a representative textbook dealing with pioblems of testing 
in schools, describes and illustrates tire piocess of stating educational objec¬ 
tives and using them to dnect test construction and test selection 
Travers, Robert M W The tiend towaid the measuiement of skills Educational 
measurement, New York Macmillan, 1955 Pp 94-115 
Tiavers explains the reason for giowrng inteiest in intellectual skills as distinct 
from mastery of facts, and describes tests used to measuie thinking skills and 
study skills 
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interest Inventories 


WE NOW turn from the study of ability tests (tests of maximum perform¬ 
ance) to the assessment of typical behavior We shall begin with interest in¬ 
ventories, paying particulai attention to selected inventories which illus¬ 
trate different techniques of measuiement With these concrete examples 
before us, we shall discuss in Chapter 15 some geneial problems of obtain¬ 
ing information on typical behavior 
Functionally, interest inventories are closely related to the aptitude tests 
we have been consid ering m piecedmg chapteis, since then_main use is m 
vocational a nd educational guidance An intei est “test” is a lengthy question¬ 
naire. It applies the “self-ieport” technique refeired to m Chapter 2, obtain¬ 
ing information by having the individual descnbe his own charactenstics 
The questionnane oi inventory may be regarded as a written interview 
which, since it uses numerous rather indirect questions, is in some ways 
more satisfactory than the duect oral interview A single direct question, 
“Would you like to be a teachei?” does not give adequate information for 
guidance because answeis may be based on ignorance oi superficial under¬ 
standing of the vocation. A girl may reject teaching for no bettei leason than 
that she thinks correcting papeis would be tedious, little lealizmg the numer¬ 
ous othei activities in a teacher’s day Likewise, some boys choose law be¬ 
cause it calls for public speaking, ignoring its long hours of isolated reseaich 
and thinking To get around such difficulties, the blunt question is leplaced 
by the indirect, comprehensive, objectively scoied inventory 
An i mp ortant advantage of the standaidized mventoiy over the intei view 
is the possibility of compaung responses to those of reference groups. A 
student may indicate that he likes 25 computational activities out of 80 such 
activities hsted m a particulai questionnaire. This, on its face, appeals not 
to indicate much liking for computational work But since oui cultuie views 
computation more often as woik than as fun, this raw score of 25 places the 
student near the 80th percentile for high-school boys Though he may not be 
strongly attracted to computation, he evidently finds it much less distasteful 
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than most boys do He is a much-better-than-average prospect for a vocation 
which combines computational duties with duties in which he would have a 
positive interest. 


THREE APPROACHES TO INVENTORY CONSTRUCTION 
Empirical Keying: The Strong Blank 

One of the two most widely used tests of interests is the Strong Vocational 
Interest Blank (SVIB), first pubhshed in 1927 (Strong, 1943) This inven¬ 
tory has been developed by means of a strictly empirical procedure; that is 
to say, it makes very few psychological assumptions and develops scoring 
formulas entirely on the basis of observed correlations of responses with cri¬ 
teria. The Kuder Preference Record, its chief competitoi, describes the indi¬ 
vidual m terms of psychological tiaits (eg, mechanical interests) The 
Strong inventory is comparable to the aptitude test designed for a particular 
occupation, where trial-and-enor selection of items maximizes predictive 
power and theory of abilities plays little part m the test construction The 
Kuder inventory is more comparable to multiscore aptitude tests intended to 
describe distinct aspects of ability; the practical implications of such abili¬ 
ties remain to be established after the test is developed. 

The SVIB consists of questions on hundreds of activiti es both vocatio nal 
and avocational. Most o f the 400 items Require a “hke-in di ffeient-dishke ” re¬ 
sponse, to activities or.topics: biology, fishing, li nin g an aviator , planning 
a sales campaign, etc Strong tried to select activi ties that adolescents would 
know or be able to imagine, rather than activities that b ecome meaning ful 
only as a result of w ork exper ience. * 

Assignment of Item Weights. Since the majority of men m a particular oc - 
cupation h ave roughly simi l ar intere sts. Strong assu mes that a pers on having 
the pa ttern typical of a n occupational gro up will find satisfaction in th at 
field. Strong identifies, for example, the mteiests characteristic of practicmg 
engineers. College students who have the same interests are advised to con¬ 
sider engineering as a vocation, and students with some other interest pat¬ 
tern are warned that they may not enjoy the work of an engineer 

The questionnaire was given to successful members of a particular profes¬ 
sion, and the interest pattern for that profession was determined by com¬ 
paring the responses of the group with those of men of similar age selected 
randomly from the whole range of occupations ordinal ily entered by college 
men A weighted scormg key was prepared to assess how closely the sub¬ 
ject’s interests correspond to those of the professional group. Table 55 illus¬ 
trates the plan by which the key was constructed. On each item, the 
percentage of men-in-general giving each answer was compared with the 
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percentage of men-m-the-occupation giving the answer Engineers dislike 
“Actor” more commonly than other men, therefore, response D is assigned a 
positive weight m the Engineer scale, The weighting is pioportional to the 


TABLE 55. Determination of Weights for Strong's Engineer Key 


Firs! 10 Items 
on Vocational 
Interest Blank 

Percentage 
of “Men-in- 
General” 
Tested 

Percentage 
of Engineers 
Tested 

Differences in 
Percentage Be¬ 
tween Engineers 
and Men-m- 
General 

Scoring Weights 
for Engineering 
Interest 


D 

D 

D 

a 

D 

D 

L 

1 

D 

L 

1 

D 

Actor (not 













movie) 

21 

32 

47 

9 

31 

60 

-12 

-1 

+ 13 

-1 

0 

1 

Advertiser 

33 

38 

29 

14 

37 

49 

-19 

-1 

+20 

-2 

0 

2 

Architect 

37 

40 

23 

58 

32 

10 

+ 21 

-8 

-13 

2 

-1 

-1 

Army officer 

22 

29 

39 

31 

33 

36 

+9 

+4 

-13 

1 

0 

-1 

Artist 

24 

40 

36 

28 

39 

33 

+4 

-1 

-3 

0 

0 

0 

Astronomer 
Athletic dl- 

26 

44 

30 

38 

44 

18 

+ 12 

0 

-12 

1 

0 

-1 

rector 

26 

41 

33 

15 

51 

34 

-11 

+10 

+ 1 

-1 

1 

0 

Auctioneer 

8 

27 

65 

1 

16 

83 

-7 

-11 

+ 18 

-1 

-1 

2 

Author of novel 
Author of tech- 

32 

38 

30 

22 

44 

34 

-10 

+6 

+4 

-1 

1 

0 

nical book 

31 

41 

28 

59 

32 

9 

+28 

-9 

-19 

3 

-1 

-2 


Source Strong, 1943, p 75 


difference. Liking to be the author of a technical book is especially common 
among engineers, since it is a significant indicator of engineering mteiests it 
is given a weight of +3 In contrast to engineers, who tend to dislike act¬ 
ing, 40 percent of aitists respond “Like” to “Actor ” The weights of “Actor” in 
the Artist scale are +2 for L, 0 foi I, and — 1 for D 

Occupational scores are converted into letter grades ranging from A to 
C Seventy percent of successful men m the occupation fall into the A group 
on that scale. The interests of a peison who falls below B+ are quite dif¬ 
ferent from those of the bulk of the occupational group Only 2 peicent of 
the men m the occupation fall as low as C 

Strong’s key is based on no psychological theoiy about engineers, he relies 
entirely on test data to define what engineers are like Some of the weights, 
such as +2 for liking to be an architect, fit our expectations Othei weights 
may seem quite “unreasonable ” Liking to wnte a novel lowers the Engmeei 
score and disliking such work counts zero, but being indifferent counts 
+1. A few weights are illogical because they come entirely out of the nu¬ 
merical findings (some of which are chance effects) and are not influenced 
by the author’s judgments 

The empirical key is a more or less heterogeneous mixture In Table 55, 
the ten responses weighted foi the Engineer key encompass interest in 
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mathematical-scientific subjects (Architect, Astronomer, Author of technical 
book), dislike for verbal activities (Actor, Advertiser, Author of novel, 
Auctioneer), indifference to Athletic director, and liking foi Army officer. 
The remainder of the key—to give only a few examples—puts substantial 
weight on the following likes calculus, chemistiy, National Geographic, 
repamng a clock, writing lepoits, and improvmg the design of a machine. 
These items all reflect scientific-technical interests. Small weights are given 
to numerous miscellaneous likes not obviously 1 elated to engineering, tak¬ 
ing long walks, symphony concerts, military diill, talkative people, courte¬ 
ous treatment from superiois. Since these scattered items have much less in¬ 
fluence on the score than the many highly con elated technical items, they 
can be neglected in psychological interpretation of the Engineer scoie. 

Keys for as many as 47 male occupations (Prmtei, Musician, etc.) are 
available There is also a women’s blank which can be scoied for Nuise, 
Stenographei, Dentist, and 24 other occupations The items of Strong inven¬ 
tories, like those of the biogi aphical inventories mentioned in Chapter 12, are 
so varied that they can be used to piedict almost anything A new key can he 
made for any vocation or specialized group For example, Strong origi¬ 
nally provided sepaiatc keys for accountants, office woikers including book- 
keepeis, and certified public accountants. A latei study of nearly 3000 prac¬ 
ticing accountants, however, found that only about 40 peicent of certain 
CPA subgroups make a score of A on CPA whereas 70 percent of men in an 
occupation are expected to make A. Strong therefore prepared a new key for 
“Senior CPA ” The onginal scale seems to apply well to partners managing 
accounting firms and was renamed the “CPA Partner” scale. The "Account¬ 
ant” scale seems to apply to junior accountants, and probably also to men 
who move from accounting into business management The “Partner” scale 
stresses verbal interests, the new “Senior” scale mvolves mathematical in¬ 
terests and has a negative relation to such verbal interests as Lawyer and 
Advertiser. Senior CPA and Partner CPA scores correlate only 07 (Strong, 
1949). 

Strong keys are not confined to vocational inteiests. By scoring answers 
which men give more frequently than women, foi example, a “masculinity- 
femininity key” was prepared In principle, the test could also be keyed to 
give an mdnect measure of scholastic aptitude, of neurotic tendency, or 
even of soundness of financial credit^ 

When the SVIB was first produced, calculating weighted scores on the 
many keys was extremely laborious Fortunately, several methods of effi¬ 
cient scoring have now been developed, and most guidance services arrange 
to send the tests to centers where they can be processed electronically. 

1. Estimate the approximate weights for the Chemist scale, of the item “Actor," if 

responses of chemists are as follows 16 percent L, 34 I, 50 D. 
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2. Estimate the weights for the Musician scale, of "Actor," if responses of musicians 
are 34 percent L, 48 1,18 D. 

3. Suppose you wished to make a key for the Strong blank for women to measure 
interest in being a mother, i.e., to predict whether a girl will en[oy raising a 
family. Outline the steps you would follow to prepare the scale, with special 
attention to the persons yog would use as a basis for the key. 

4. Each of the following assumptions is implied in the construction or in some uses 
of the SVIB. For each one, state a contradictory hypothesis that might be rea¬ 
sonable. 

a. One is not likely to succeed in an occupation unless the work is interesting 
to him. 

b. One is not likely to succeed in an occupation unless his interests are similar 
to those of most other men in the profession 

e. Interest in the school sub|ecfs required for preparation for a profession is not 
an adequate basis for predicting satisfaction in the profession, 
d. The interests leading to satisfaction in a vocation in 1930 will also be associ¬ 
ated with satisfaction in 1970. 

5. Research psychologists generally find considerable mathematical work neces¬ 
sary, yet liking for mathematics is assigned a weight of zero in the Strong scale 
for psychologists. How can this seeming inconsistency be explained? 

6. "An A rating in psychologist with B+ in physician and dentist should suggest a 
different preparation and career than an A rating in psychologist with B+ 
ratmg in engineer, production manager, and carpenter” (Strong, 1943, p 54). 
What differences in advice are justified in these cases’ 

7. How might an interest test be used to distinguish, among prospective teachers, 
those likely to be traditional subject-matter teachers from those likely to em¬ 
phasize the development of the pupil as a person? Outline a plan for research 
to develop such a procedure. 

8. Kuder's Occupational inventory (not to be confused with his Vocational inven¬ 
tory) is scored empirically by weighting items in a manner similar to Strong’s. 
Kuder mentions the following principles used in developing his scale Comment 
on the reasonableness of each principle 

a. The vocabulary should be kept simple. 

b. To keep obvious vocational significance of the item to a minimum, items 
should not consist of occupational titles to be checked as liked or disliked 

c. It is generally more important to sample a large number of relevant areas 
than to obtain large samples of only a few areas. 

d. When the purpose of a test is to differentiate between groups, reliability 
within groups (e g., within the group of engineers) is relatively unimportant. 

9. Would Strong improve his Engineer key by discarding weights which do not 
seem logical even though the item in question shows a difference between Engi¬ 
neers and men in general? 

Interest Clusters. Although Strong’s original purpose was to make predic¬ 
tions about suitability foiyspecific occupations, his test is used equally often 
to obtain a geneiaL description of the person being counseled. Such a' de¬ 
scription must organize the responses m terms of psychologically meaningful 
traits Factor analysisjaf the vocational keys has produced a set of descriptive 
traits for the SVIB. 
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The analysis indicates the following clusters of occupational interest for 
men: 

Group I, Creative-scientific: Artist, psychologist, architect, physician, 
dentist. 

Group II, Technical. Mathematician, physicist, engineer, chemist 

Group III Pioduction manager. 

Group IV, Sub-professional technical' Farmer, carpenter, printer, math¬ 
ematics-science teacher, policeman, foiest service 

Group V, Uplift YMCA physical director, personnel manager, YMCA 
secretary, social science teacher, school superintendent, minister. 

Group VI. Musician. 

Gioup VIT Certified public accountant 

Group VIII, Business detail Accountant, office man, purchasing agent, 
banker. 

Group IX, Business contact Sales managei, leal-estate salesman, life in¬ 
surance salesman 

Group X, Verbal. Advertising man, lawyer, author-journalist. 

Group XI: President of manufacturing corporation 

Special keys for those gioups involving several occupations have been pre¬ 
pared, so that the counselor can score the blank on these eleven factors and 
thus arnve at a meaningful oveiall descnption The counselor using hand 
scoring may find it efficient to score the blank for the occupational groups as 
a first stage m counseling, and then to apply specific occupational keys only 
for occupations which seem impoitant after discussion of the group-key pro¬ 
file with the subject Such two-stage scoring is inefficient, however, when 
the blanks are processed electronically 
Darley and Haganah (1955, p. 34) advise against the use of group keys, 
warning that a student may not have a high scoie on a gioup key within 
which some of his high occupational scores he Physicists, to be suie, will 
definitely be identified by the Group II key (Technical), where 97 percent 
of them earn A’s. Among engineers, only 62 percent earn A’s m Gioup II. 
Since only 70 percent of engineers earn Asm the Engineer key, this does 
not appear to be a damaging ciiticism of the use of group keys The writer 
recommends that counselors use group keys m hand scoring, but that they 
give attention to groups where the client scores B+, as well as those where 
he scores A 

Strong has designed a map for plotting the families of occupations The 
chait lepresents the surface of a globe The record shown in Figure 73 has 
high scores runnmg from Group V over tire "North Pole" to Group X. This 
pattern of interests was shown by a psychology major tested in his senior 
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FIG 73 Interest chart for the Strong blank (Chart copyright 1945, Stanford University Press, and reproduced by permission) 
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year. His interests emphasize "verbal” and "uplift” activities, with lower 
scores on accounting and selling 

10. If the student represented in Figure 73 has done good academic work and has 
been satisfied in his courses in psychology, what possible vocational aims are 
suggested by the test? 

11. There are many groups in which this student has low scores. Which of these 
lacks of interest would be significant in deciding against certain positions in 
psychology? 

12. The three dimensions (up-down, left-right, front-back) of the chart represent 
the three chief interest factors in the SVIB scores How might these factors be 
named? 

13. How would one use the SVIB in counseling a boy who is considering becoming 
a librarian or an English teacher, since there are no keys for those occu¬ 
pations? 

Homogeneous Keying: The Kuder Preference Record 

The evolution of Kuder’s inventory was almost exactly opposite to that of 
Strong’s. Kuder began with a factor analysis of single items m.order to iden¬ 
tify, clusters "of interests, and then organized these items into descriptive 
scale?.. The scales were used in educational and vocational guidance even 
though predictions rested on mfeience rather than evidence of predictive 
validity. With the passage of time, information on the predictive validity of 
Kuder profiles has been collected. Today scores for specific occupations can 
be constructed from the Kuder profile just as for the Strong, although the in¬ 
strument is still used most often as a trait description. 

We shall discuss primarily Form C of the Kuder Preference Record 
Form A, also in current use, is a personality test (see p. 496) Form B is an 
early version of the vocational inventory, now replaced by Form C, and 
Form D is a recently developed set of questions designed to yield specific 
occupational scores like those of the SVIB and not intended for description 
Thus Form C best illustrates the development of descriptive keys. 

Kuder identified ten clusters of occupational interests, a cluster being a 
group of items which have substantial coirelations with each other. Such 
a gioup is said to be homogeneous, ie., there is a common factor running 
through die items The ten scores constituting the Kuder profile are Out¬ 
door, Mechanical, Computational, Scientific, Persuasive, Artistic, Literary, 
Musical, Social Service, and Clerical, 

Each item is in the "forced-choice” form Three activities are listed, for 
example' 


a Develop new vaneties of flowers, 
b Conduct advertising campaign for florists 
c. Take telephone oideis m a fionst shop. 
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The subject is to select the one he likes most and the one he likes least, 
leaving the third unmarked A person who chooses “a” as most liked receives 
credit under Scientific and Artistic, choice of “b” scores as Persuasive, and 
choice of “c” is counted as Clencal. These scorings are not arbitrary, the 
items are counted m that key whose other items they correlate with Judg¬ 
ment enteied the test construction only when Kuder decided what items to 
use in his original tryout 

The occupational interpretation is usually made by identifying the two 
highest scores m the piofile and referring to a list of occupations for which 
thoWscores are believed or known to be relevant According to the test 
manual, a “3-6” piofile (ie., one with highest scores m categories 3, Scien¬ 
tific, and 6, Literary) suggests the occupations author, editor, reporter, 
physician, surgeon, psychologist, and etymologist 

Kuder scores aie most often interpreted on the basis of their "common- 
sense” meanings A person like Maiy Thomas whose profile (Figure 74) 



FIG 74 Kuder profile of Mary Thomas The Profile Sheet for Women of the 
Kuder Preference Record (Adapted by permission of Science Research Associates/ 
publisher) 


shows high Clerical and Computational scores presumably will enjoy posi¬ 
tions demanding such activities The low-interest areas are also important, 
since the person might dislike work demanding such activity. In Mary 
Thomas’ case, the interest test was highly informative. She was majoring in 
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child development m college at the time she took the test Her grades were 
mediocre, and her work with children was not especially successful, When 
questioned regarding her choice of major, she explained that she had set 
her heart on work in an orphanage This desire had ansen in childhood 
when she read a book about a woman who helped orphan children, and this 
had seemed to her a “wonderful” thing to do as a lifewoik. The low Kuder 
scores in Persuasive and Social Service activities suggested a somewhat with¬ 
drawn personality, while the high Mechanical, Computational, and Clerical 
scores suggested a liking for routme, uncieative activities When questioned 
about office work, she enthusiastically described her previous summer’s 
work as a file clerk, her duties apparently consisted solely of alphabetizing 
folders, yet she had “just loved it” Moreovei, she had done well m secre¬ 
tarial training courses Evidently both ability and interest fell m an area she 
had not considered as a vocational goal 

14. What tentative conclusions can be drawn about a college man with the fol¬ 
lowing percentile scores Outdoor, 60; Mechanical, 50, Computational, 30, 
Scientific, 70, Persuasive, 98, Artistic, 70, Literary, 90; Musical, 50, Social 
Service, 40; Clerical, 15’ 

15. A boy ma|oring in business administration shows high interests in Persuasive 
and Social Service He is near average in Clerical and Computational. He has 
a high score (78th percentile) in Scientific, which is not usual among business 
managers. What do these findings suggest’ 

16. A person's absolute interests are indicated by the proportion of the possible 
preferences he marks in a given category A particular man earns 40 percent 
of the highest possible score in the Literary category, this places him at the 
80th percentile He earns 50 percent of the possible points in Mechanical; 
this places him at the 59th percentile in that score because the average man 
has a higher percentage score in Mechanical than in Literary. In what sense is 
it correct to say that this man has a higher interest in mechanical than in lit¬ 
erary activities? Should guidance be based on relative or absolute scores? 

17. What high and low points would you expect to find in the Kuder profiles of 
these groups’ Compare your answers with the results in Kuder's manual. 

a. Female personnel managers 

b. Female retail buyers 
e. Female secretaries 

d. Male manufacturing foremen 

e. Male photograpers. 

Occupational Keys. When Kuder published the fiist form of his instrument 
m 1940, he suggested interpretation solely on logical bases This required as¬ 
sumptions about the interests relevant to various occupations, and studies of 
occupational groups weie needed to validate those assumptions Kuder’s 
data, gathered more or less as opportunity has permitted, do not necessarily 
represent adequate samples for the various professions Information on psy¬ 
chologists, for example, was obtained by asking 260 Fellows in four Divi¬ 
sions of the American Psychological Association to fill out the scale, 111 
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persons (46 percent) provided data (Baas, 1950) Separate tabulations are 
made for the 27 clmical psychologists, 26 counseling psychologists, 29 theo¬ 
retical psychologists, and 29 industrial psychologists For all psychologists 
combined, the median fell at the 84th peicentile (of men m general) m 
Scientific and Literary, between 60 and 70 m Computational and Social 
Service, and at or below" 30 m Clerical, Mechanical, and Persuasive The 
only noticeable differences between subgroups were in Artistic (Theoretical 
and Clinical above 60, others below 40) and Social Service (Clinical and 
Consulting above 70, others below 50) 

The results found by Kuder generally support the logical expectations 
The median piofile for accountants shows peaks in Computational and Cleri¬ 
cal Authors, editors, and leporteis have a peak in Literary, chemists in Scien¬ 
tific, musicians in Music At the same time, there are enough departures 
from expectation to demonstrate that logical presuppositions must be tested 
(D N. Wiener, 1951) The median for Engineeis, for example, is 64 m Me¬ 
chanical, 68, Computational, 73, Scientific These depait fiom 50 m the ex¬ 
pected direction, but not very far, and many engineeis are below average m 
one or all of these scores Camp counselors of the YMCA might be expected 
to have distinctly high Social Service scores, but they average only at the 
69th percentile, being equally high in Persuasive, Musical, Artistic, and Lit¬ 
erary 

Regression equations may be used to combine interest scores into a com¬ 
posite winch distinguishes men in an occupation from men in general The 
best simple formula for identifying carpenter interests counts Mechanical 
positively and gives equal negative weights to Scientific, Literary, and Cleri¬ 
cal A formula of this type was highly effective m sepaiatmg carpenteis from 
men m geneial and from men in othei trades (Mugaas and Hester, 1952) 
Kuder profiles can therefore be transformed into occupational scores like 
those for the SVIB. In practice, such translation is uncommon, because 
counselors are moie interested in helping students toward a geneial self-un¬ 
derstanding than m selecting particular occupations for them 

Logical Keying: The Lee-Thorpe inventory 

There is still a third approach to questionnaire construction The Occupa¬ 
tional Intelest Inventory by Lee and Thorpe is a set of questions selected 
and organized on the basis of judgment rather than on statistical grounds 
This, which we may refer to as a “logical” approach, contrasts with the 
Strong and Kuder procedures, which depend primarily on statistical find¬ 
ings, The logical appioach is similar to the technique of constructing pro¬ 
ficiency tests by defining a universe of situations and selecting items ran¬ 
domly from that universe 

Lee and Thorpe took as their startmg point the description of occupa- 
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tions given in the Dictionary of Occupational Titles This handbook pre¬ 
pared by the USES defines and classifies virtually all American occupations 
Within each of six areas, tasks were selected to represent high, medium, and " 
low levels of responsibility. Brief job descriptions aie presented in pairs, the 
subject indicates winch of the pan he piefers Scores mdicate the relative 
frequency of choices in the Personal-Social, Natural, Mechanical, Business, 
Arts, and Sciences categories. 

This mventoiy diffeis from the SVIB in that the original selection and 
classification of items is based entirely on job descriptions rather than on 
empirical evidence that persons in the job like the activity. It differs from 
the Kuder m that the fiist gioupmg of items came from a logical analysis 
rather than statistical isolation of factors The Mechanical category, for in¬ 
stance, includes a great vanety of tasks' labeling bottles, operating a lathe, 
making drawings with rulei and compass, lepairmg shoes, operating an ele¬ 
vator, testing the strength of steel structmes, designing airplanes, etc Such 
a heteiogeneous categoiy is difficult to mteipret either descriptively or 
predictively Knowing that a person likes half of the activities in such a 
mixed group tells us little about what jobs he will find satisfying Although 
the original classification of items was based completely on logical criteria, 
a subsequent correlational analysis was made to improve homogeneity. 
Items which had nothing m common with the rest of the categoiy were elim¬ 
inated in revising the test 


Relations Between the Inventories 

^Initially, the three inventories weie designed by quite different tech¬ 
niques. Strong starts with almost no theory and searches out those interests 
which go with membership in an occupation Relations need be neither 
logical nor psychologically interpretable Lee and Thoipe begin with the oc¬ 
cupational category, basing items on duties which fall within the occupa¬ 
tional field If Strong finds that engineers like mountain climbing, he uses 
that item, Lee and Thorpe, however, would find such an item irrelevant to 
the occupational description If engmeeis have to check blueprints, on the 
other hand, Lee and Thorpe would piesumably include that m the Mechan¬ 
ical score even if engineers dislike the task Kndei ignores Am 
stmcture-.at_.the_ outset: instead..he-s eaiches for a set of traits whic h sum- 
mari ze the m am differences between persons Only after interest factors are 
identified does he turn to logical analysis of occupations and then to voca¬ 
tional interpretations based on statistical evidence 
Despite the diffeiences m initial conception, we find that the three inven¬ 


tories have converged on much the same sort of measurement Through 
factor analysis of Strong keys and through case studies, it has become pos- 
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sible to translate SVIB occupational scores into a trait description Through 
use of regression equations and profiles of occupational groups, Kuder 
scores can identify occupational categones into which people fit Lee and 
Thorpe have begun to purify their scales to increase mterpretability and if 
they desired could collect occupational norms for the test. In principle, there¬ 
fore, all tests can fulfill all functions. Each has its own characteristics, but re¬ 
search gives no definitive answer as to which approach is best 
The inventories measuie approximately the same mteiests and the corre¬ 
sponding keys have substantial overlap, as can be seen m Table 56 This 


TABLE 56. Selected Correlations Between Strong and Kuder Scores 






Kuder Scales 




SVIB 

Artis- 

Scien- 

Me- 

Compu- 

Social 

Clen- 

Persua- 

Liter- 

Group keys 

tic 
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tational 

Service 

cal 

siye 
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Creative- 

scientific 

Scientific- 

45 

34 




4c 

* 


technical 


67 

58 

14 


* 

* 


Uplift 

Business 


* 

* 


39 



30 

detail 

Business 

* 



50 

* 

60 

.36 


contact 

* 

* 

* 

+ 


29 

.70 


Verbal 


* 

* 

* 




.48 


Asterisks indicate substantial negative relationships The remaining correlations are 
between + 30 and — 30 

Source Cottle, 1950b, see also Tnggs, 1943 


table gives correlations for selected scales of the Stiong and Kuder instru¬ 
ments Most of the correlations are m the neighborhood of 50-.70 for closely 
corresponding scales The corresponding scoies for any two inventories, 
however, mvolve a substantial amount of independent content 

18. A college student states that he is interested in a career in the diplomatic serv¬ 
ice. This may be a response to the glamor of the field rather than a genuine 
interest. Which inventory would give the most relevant information for guiding 
him? 


VALIDITY OF INTEREST MEASURES 
Stability of Interests 

The first assumption m using interest tests for counseling is that they 
measuie a stable characteristic Evidence of stability is not enough to estab¬ 
lish validity, but it is a necessary first consideration 

Strong, m his extensive follow-up studies, finds that interest scores are in¬ 
deed stable after age 17. When Stanford students were retested after an 



418 ESSENTIALS OF PSYCHOLOGICAL TESTING 

eighteen-yeaT interval, the two measures of interest were substantially cor¬ 
related. 76 on the Physician scale, 54 on Peisonnel Manager, .68 on Sales 
Manager, .73 on Lawyer, etc (Strong, 1955, p 63, Dailey and Haganah, 
1955, pp 37, 53) Scores in high school aie less stable. Between tenth and 
twelfth grades, the aveiage correlation was .57 (Canning et al,, 1941). 

Another way of examining stability is to compare the test and retest pro¬ 
files of each individual to see whether the same scores remain high In 
Sthong’s eighteen-year follow-up, only about 6 peicent of all A ratings 
changed to C, 3 percent of C’s to A Foi 17 percent of the cases, the correla¬ 
tion between the test and letest profiles was 90 oi higher Fifty peicent of 
the cases had profile coirelations of at least .80, which means that the two 
tests—eighteen yeais apart—would lead to essentially the same occupa¬ 
tional advice There aie a few exceptional cases whose piofiles on the two 
occasions weie markedly dissimilar In eight cases, theie weie negative cor¬ 
relations (langmg as low as — 47) between the two profiles, indicating that 
the peaks and valleys of the piofile had actually been reversed m the in¬ 
terval (Strong, 1955, p 64; see also, Darley and Haganah, 1955, p 43) 

Since the Kuder inventory was developed moie lecently than the SVIB 
there have been fewer studies of its stability One study (Heizberg and 
Bouton, 1954) covenng changes from age 17 to 21 m a gioup who went to 
college finds correlations ranging from 50 to 75 foi the various scales Me¬ 
chanical was especially stable foi boys, but otheiwise no important differ¬ 
ences between scales were found Mallmson and Crumbme (1952) warn 
against making long-range decisions fiom Kuder scores in Grade 9 On re¬ 
tests in Grade 12, they found considerable stability, but many patterns 
showed important changes For only 74 peicent of the pupils did the two 
highest interests m Grade 9 remain among the three highest m Grade 12. 
And for only 76 percent did the lowest Grade 9 interest remain among the 
lowest three in Giade 12. 

Interests gradually crystallize as the individual begins to discover himself 
and to pile up rewaidmg experiences in a few fields Individual differences 
in mterests aie evident even in the pieschool yeais Little has been done to 
investigate childhood interests save for tabulations of typical inteiests at var¬ 
ious ages Leona Tyler (1951, 1955), however, measures individual interests 
during the elementary grades. 

In junior high school the pupil begins to form a matuie picture of adult oc¬ 
cupations and of the blanches of knowledge. He also begins to form a con¬ 
scious picture of how he diffeis from other persons His experiences in sci¬ 
ence, shop, and othei courses give him opportunity to learn what he likes, 
but his acquaintance with these areas remains superficial As he pursues 
more specialized high-school courses, reads more adult magazines, and in¬ 
dulges m hobbies, his mterests become more definite The boy who likes 
spoits spends more time in them, increases his athletic proficiency, makes 
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friends having the same mterests, and so builds a little world in which he re¬ 
ceives social reinforcement for this interest Since high-school activities are 
intended to develop interests, vocational guidance prior to the senior year 
should be aimed chiefly to point out areas for exploration A student who en¬ 
ters high school with greater-than-average liking foi peisuasive activities 
should enroll in courses and activities which will test and clanfy that in¬ 
terest, but he should certainly not commit himself to become a lawyer or 
salesman 

Vocational choice is a continuous process Certain doors are closed to a per¬ 
son who does not get the right tiaming at the right time, but whatever 
training he does get leaves him with many options Opportunities to special¬ 
ize, to turn to administration, or to learn new skills are constantly before the 
professional man and white-collar worker Long after the skilled workei 
leaves school, he likewise continues to make vocational decisions—to change 
jobs, to acquire a new skill, or to go into business for himself. The worker m 
a routine factory job has less chance foi vocational choice, change, when it 
comes, is forced on him by a layoff, and his mteiests have little to do with 
the choice among openings available to him. Dunng the yeais from 40 to 
60 theie may be little change in vocational responsibilities—research on this 
period is virtually nonexistent—but the approach of retirement brings fur¬ 
ther demands foi self-knowledge and choice of activities. 

Interests never become permanently fixed Bio ad lines of intei est remain 
unchanged for many years, but there is a constant reshaping of the detailed 
pattern The beginning psychologist may enjoy everything about the job 
testing, dnecting an experimental laboratory, analysis of data, vocational ad¬ 
vising, giving talks on occupations After a while he finds his greatest 
satisfaction m advising and begins to leave the other tasks to someone else 
whenever he can His specialization may become even narrowei, if he finds 
that his greatest rewaids come from dealing with students whose voca¬ 
tional questions are part of a bioader emotional conflict between a student 
and his parents. No label really tells what a peison entering a piofessional 
field will do Inteiests are quite stable enough to help a student choose be¬ 
tween broad lines of training, but m that training and subsequent experi¬ 
ence he will modify his mterests and make further caieer decisions If jobs 
were highly steieotyped, so that every 25-yeai-old woiker in the occupation 
did the same thing and kept on doing the same thing until retirement, in¬ 
terests would need to be far more stable than they aie to insure continued 
satisfaction with one’s work, 

Prediction of Vocational Criteria 

Overlap with Claimed Interests Does the interest inventory give more in¬ 
formation than could be obtained merely by asking the person what fields of 
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work he thinks he would like? The evidence on this point shows that inven¬ 
tories fulfill an important function Students were asked to estimate their 
own Kuder profiles, 1 e , to repoit the strength of their various interests The 
average correlation between estimated mteiest and measured interest was 
52 (Crosby and Wmsor, 1941) Anothei investigation compared claimed 
vocational interests with interests measured by the SVIB (Hagan ah, 1953, 
cited in Darley and Haganah, 1955, p. 67) Roughly two-thirds of those 
with claimed interests m business detail, business contact, and technical 
fields had similar measured interests, but the test supported the statements 
of only about one-thud of those who claimed a primary interest in scientific, 
social service, and veibal-linguistic fields At least two factors account for 
such disagreements between scores and claimed interests. Hie use of a large 
number of items, many of them indirectly related to the job m question, pro¬ 
vides a thorough sampling of interests which is moic reliable and more pene¬ 
trating than the self-estimate. Secondly, even the student who knows his own 
interests is unable to judge how his mteiests compare with those of other 
peisons. 

The counselor must not, however, assume that interest tests are more 
valid than expressed interests Foi the older subject whose expressed and 
measured interests differ, there is some evidence that his expressed interests 
better predict what field he will enter (Wightwick, 1945) Wliei e the two 
disagree, the counselor will want to make sure that the expressed interest 
is based on mature consideration, but he would be unwise to dismiss it as 
wrong 

Prediction of Occupational Choice and Satisfaction Interest scores discrimi¬ 
nate between men m various occupations; this in itself is partial evidence 
that the scores are a sound basis for guidance Both the Strong and the 
Kuder tests have been studied sufficiently to verify that the majority of per¬ 
sons successful in an occupation have corresponding interest scores. It will 
be recalled that the SVIB is keyed so that 70 peicent of successful adult engi¬ 
neers earn A’s on the Engmeei scale Among students, 30 percent of those 
who later make a career m engmeei mg score A in college Data foi the 
Kuder test are not so fully reported, though theie is evidence that scientific 
and computational interests are characteristic of those who persist m engi¬ 
neering (Barnette, 1951) 

While either test can point out occupations foi which the person seems to 
have appropriate interests, it is obvious that the usual scores do not pm 
down the vocational choice nairowly Indeed, subclasses withm the same 
profession diffei a great deal m their interests, as Strong’s accountant study 
illustrated. Dunnette (1957) tested four types of engineers employed m the 
same company High interests in each group are plotted m Figure 75 The 
“pure research” group show interest in both scientific and technical fields, 
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FIG 75 Interests shown by engineers performing different functions (Dunnette, 1957), Each 
dot shows a Strong score in which the average score of these engineers was A or B+, a circle 
shows an average of B For names of specific occupations, refer to Figure 73 
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whereas the “applied research” group score lower m the scientific (Group I) 
occupations and have more interests m common with office and sales per¬ 
sonnel The production and sales engineers are rathei similar, both being 
interested in sales and office tasks and not in technical areas. Dunnette 
developed special keys for sepaiatang the engineers into the four types, and 
found that he could correctly classify two-thirds of the engineers in a cross- 
validation group. It is of interest to note that he tried two piocedures 
weighting item responses in the usual Strong manner, and combining 
Strong’s occupational keys in a regression formula The two methods woiked 
equally well, but the weighting of scores was much simpler since the occupa¬ 
tional scores had previously been calculated. (See also Estes and Horn, 1939, 
Strong and Tucker, 1952 ) 

Inteiest tests can discriminate men satisfied in a job from those who are 
dissatisfied Peny (1955) divided Navy yeomen (clerks) into satisfied and 
dissatisfied groups, according to whethei they said they would choose the 
same service career if they could start over On the Office Woiker key of the 
SVIB, the mean scoie of the satisfied yeomen was 48 while that of the dis¬ 
satisfied group was 21 (The s d being 33, the difference is highly signifi¬ 
cant.) A guidance service which had given the Kudei mventoiy to high- 
school seniois and adults asked them, a year or moie latei, what work they 
weie doing and how well they liked it The mvestigatois then classified each 
peison according to whethei his tested interests weie suitable for the job 
he held A similar judgment was made rcgaiding his measured general men¬ 
tal ability As Figuie 76 shows, mteiests do foiecast satisfaction, and the com¬ 
bination of interests and ability taken together is an excellent predictor 
Fuither evidence that inteiest differences predict futuie satisfaction is 
found in Strong’s studies (1943, pp 114 ff ) of men who change from one 
field to another after leaving school, His follow-up study supports all the fol¬ 
lowing statements 

Men who remain m an occupation foi ten years oi moie aveiage higher 
scores for that occupation than for any other 

Men continuing in an occupation have higher scoies m that interest 
than men who tiy the occupation and change. 

Men who change from one occupation to another change to one in 
which their inteiest scores were about as high as foi the first choice. 

The correlation of interest scoies with professed satisfaction is low (about 
20) in Stiong’s study and in other studies he cites The pnncipal reason ap¬ 
peals to be that among college giaduates even those with low mteiest m 
their work usually lepoit satisfaction, presumably because prestige, working 
conditions, and role m the community play a large part in job satisfac¬ 
tion. 

Strong’s most impressive data (1955) aie those showing that college in- 
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terest scoies predict what occupation the man will actually be engaged in 
eighteen years later Among men later employed in an occupation, five times 
as many had A+ ratings in that occupation m college, three times as many 
had A— ratings, and one-fifth as many had C ratings as among men em- 
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FIG 76 Dependence of |ob satisfaction on suitability of interests (Lipsett and Wilson, 1954) 


ployed in other fields In the Stanford sample an A m Engineer indicates 
one chance in three of becoming an engineer, one m three of entering a re¬ 
lated occupation, and one in three of entering an occupation having little 
resemblance to engmeenng Tins indicates good validity, since a man may 
have several A’s yet can enter only one field of woik 

McArthur (1954, McArthur and Stevens, 1955) challenges too simple an 
assumption that interest scores ought to predict what a peison will become 
He argues that many forces other than mteiests determine what field a per¬ 
son will enter Particularly, the family of the well-to-do boy may dictate what 
field he will enter, and provide the economic suppoit to assuie success 
When the presumably wealthier Harvard students who had come from pn- 
vate schools were considered separately from those who were public- 
school graduates, a marked diffeience appeared The private-school gioup, 
m general, enteied an occupation coiiespondmg to their claimed interests 
and not to their measured interests The public-school gioup (generally up¬ 
ward-mobile middle-class boys) enteied fields coirespondmg to measured 
interests 

This evidence does not deny that the SVIB accurately measures interests 
of uppei-class boys “It is not,” Darley says (Gee and Cowles, 1957, p. 26), 
“that the Stiong doesn’t ‘woik,’ it is that you don’t need it when students’ oc¬ 
cupational choices aie so completely determined by the subculture from 
which they come. All you need to do is to ask a boy in this paiticular private 
prep school what he is going to be and you get the right answer since this is 
totally predetei mined by his entire environment The Strong may truly re¬ 
flect another pattern of motivation which his subcultuie does not allow him 
to use, and this is the tragedy of that subculture” 
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For those entering professions 01 skilled jobs, interest measures are highly 
meaningful. In the occupational world, however, a large numbei of jobs 
are essentially routine and offer little possibility of self-fulfillment. As Darley 
and Haganah (1955, pp, 8-9) point out. 

Only at [professional, managerial, and skilled] levels do students 
tend to say “that would be an interesting job ” We as adults also tend to 
feel that the really “mteiesting” jobs are to be found only in the upper 
categories and that many woikeis are doomed to tasks requiring 
little training, repetitive and routine activities, and rather undemand¬ 
ing or unchallengmg work assignments. . In the vanous job satisfac¬ 
tion and morale studies, a ciude division of responses appears to be 
related to the hierarchy of occupations Respondents at lower occupa¬ 
tional levels stress as sources of satisfaction economic factors, security, 
a chance to get ahead, a need foi recognition as peisons. Respondents 
at upper economic levels define satisfaction in terms of “interesting 
work” For the foimer group, satisfaction denves from sources ex¬ 
ternal to the woik. 

There probably aie no special patterns of interest charactenstic of the un¬ 
skilled occupations. Strong developed keys only for responsible positions 
—even his Group IV coveis only skilled tiades. When Clark (1950) tried to 
develop interest keys for vanous jobs by Stiong’s method, he found no dis¬ 
tinction among men in vanous unskilled tiades. He did, howevei, succeed 
in differentiating skilled trades from each other The few Kuder data avail¬ 
able on lower-level gioups support this view Aveiage piofiles for such 
groups as filling-station attendants, depaitment store help, and painters are 
similar, being much flatter than the profiles for professional and skilled 
groups. 

This result makes sense when we think about job duties The mounter, 
whose aptitudes we discussed eailier, spends hour upon hour fastening 
wires together; wherein lies his vocational satisfaction? If he is to be satisfied, 
his pleasure must come not fiom the woik itself but fiom companionship, 
good working conditions, and fieedom from responsibility. The very essence 
of interest is a changing environment, presenting new situations to be in¬ 
terpreted and dealt with. Thus, while some people can be content in a rou¬ 
tine job, it probably cannot command active interest. 

19. Darley and Haganah estimate that the Strong vocational keys cover only 20 
percent of the male working population. Does this constitute a serious criticism 
of the Strong scale? 

20. Production engineers average only B— and sales engineers average C on the 
Strong Engineer scale What does this imply regarding the use of Strong scores 
in guidance? 
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21. Assuming a normal distribution of Office Worker scores, and the same s,d in 
both groups, sketch distributions for satisfied and dissatisfied yeomen in Perry’s 
study How well does the Strong distinguish these groups 7 

22 Expressed interests of private-school graduates predict what they will be doing 
more accurately than their inventoried interests Does this mean that the SVIB 
is an invalid indicator of their interests? 

23 How is the interpretation of Strong's follow-up studies influenced by the fact 
that interest scores were discussed with the sublets during college? 

24. In addition to the original Physician key. Strong and Tucker developed two 
sets of keys for medical specialties such as surgery. One key is based on items 
which differentiate surgeons from men in general, the other on items which 
differentiate surgeons from a mixed group of physicians Which of the three 
keys could best be used 

a. for guidance of a college freshman? 

b. for advising medical-school seniors about their careers 7 

c. for assigning Army medical officers to duty and advanced training 7 

25. Among veterans who planned to enter engineering, a success and failure 
group were distinguished by Barnette, according to whether they continued in 
engineering training. How do you explain the fact that the Kuder Mechanical 
score, though high in both groups, had no relation to continuance whereas 
Computational had a very marked relation 7 

26 Do you agree with the following statement 7 

“Insofar as stated choice of occupation by groups of individuals (high school 
girls) may be considered a true criterion of interest, the lack of relationship 
between statement of occupational choice and interest scores . . . may be 
considered evidence of the lack of validity of the interest inventories." 

Prediction of Occupational Success Only a few studies have examined 
whether the usual inteiest scores predict job peiformance The most ade¬ 
quate studies aie Strong’s investigations of insurance agents (1943, pp 486- 
500) The Strong scores predict eithei ratings or records of business pro¬ 
duced, with coirelations of about 40 Men with A scores m sales inteiest 
wrote, on the aveiage, $169,000 per year of new policies, wheieas C men 
averaged only $62,000 A few C men, however, were unmistakably success¬ 
ful 

E L Kelly and D W Fiske (1951) tested students entering training for 
clinical psychology with predictors of all types • ability measures, personality 
questionnaires, peifoimance tests of personality, and interview ratings Four 
years later they collected such criteria as grades, scoies on peiformance 
tests, and latings by tiaming supervisors Particular interest attaches to the 
ratings on Oveiall Clinical Competence and Research Competence. In a 
study with hundreds of piedictive scores and a dozen criteria, no single 
coefficient is dependable, but several findings regarding interest tests 
emeiged The Kuder proved to have rather little predictive value: out of 
117 correlations, only 16 (13 percent) reached 20. For the SVIB, which has 
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more scales, a total of 677 coefficients were determined, and 140 of these 
(21 percent) reached 20 Except for the Miller Analogies Test, a measure 
of verbal ability, no test yielded better predictions than the Strong Most of ™ 
the larger correlations foi the overall clinical and research catena are sum¬ 
marized m Table 57. The correlations may well have been leduced by re¬ 
stricted range m the group and by unreliability of the ratings. Verbal inter¬ 
ests and creative-scientific interests appear to be associated with success m 
clinical psychology, while interest in business activities is associated with 
low ratmgs There are some differences between those high in reseaich and 
those high as clinicians, the foimei being stronger m scientific interests and 
distinctly lower m business interests Strong’s ongmal Psychologist key, pre¬ 
pared m the 1920’s, emphasized the mteiests of leseaich psychologists and 
here correlates quite high with rated research competence The Kriedt 
(1949) keys for various types of psychologists show that among interests 
chaiacteristic of psychologists there aie numerous patterns and each pattern 
is relevant to a different type of success. 


TABLE 57. Correlations of SVIB Scales with Ratings 
Clinical Psychology 

of Trainees in 


Correlations with Ratings on 


Overall Clinical 

Research 


Competence 

Competence 

Group 1 Artist, Architect, Physician 

10 to .22 

22 to 34 

Group II Mathematician, Physicist, 

Chemist 

- 04 to + 09 

27 to 36 

Group III Production Manager 

- 25 

- 10 

Group V Personnel Director, Social 

Science Teacher 

- 03 to + 06 

— .15 to — 01 

Group VIII Office Man, Purchasing 

Agent, Banker 

- 24 to - 08 

- 30 to - 25 

Group IX Sales Manager, Life 

Insurance Salesman 

.02 to .07 

- 24 to - 21 

Group X Advertising Man, Lawyer, 

Author-Journalist 

24 to 35 

15 to .27 

Psychologist keys 

Original Strong Psychologist 

18 

43 

Kriedt Psychologist 

20 

38 

Kriedt Clinical 

.26 

01 

Kriedt Experimental 

- 08 

16 

Kriedt Guidance 

00 

- 22 

Kriedt Industrial 

- 21 

-.22 


Sounce E L Kelly and D W Fiske, 1951, pp 150-155 


Interest inventories have shown negligible value for predicting success in 
vocational training The Air Force correlated scores on various interest cate¬ 
gories with grades m thnteen training schools. Almost all conelations were 
below 20 (Brokaw, 1956) 
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Many students confuse interests with aptitudes and misinterpret the in¬ 
terest test as a measure of what they can do best Interests obviously tell 
nothing about abilities, in general, the correlations between interests and 
corresponding abilities (eg, between Kuder Clerical and DAT Clerical) 
are close to zeio A high interest scoie should be interpreted as indicating 
that if a person suivives tiainmg and enteis the occupation, he is likely to 
enjoy his work Though interests imply motivation, then influence on suc¬ 
cess is latliei small The Frederiksen-Melville study (p 345) explains this in 
part They found that grades of “compulsive” students depend only on 
abilities, such students make an effoit whether interested 01 not, and their 
interests have no predictive value Among noncompulsive students, how¬ 
ever, interests pi edict achievement with validity 36-55 While it is danger¬ 
ous to generalize fiom this one study, it seems reasonable to conclude that a 
person with interests and abilities suitable for an occupation can and will do 
well in it, a person with suitable abilities but unsuitable interests can do well 
but may not, and a person with suitable interests and low aptitude will do 
badly (cf Fig 76) 

Efficient piediction of success cannot generally be expected for interest 
scores based on differences between men-in-the-occupation and men-m- 
general It is necessaiy to establish differences between good-men-in-the-oc- 
cupation and poor-men-in-the-occupation The characteristics differentiat¬ 
ing good from poor veterinarians might have no resemblance to the pattern 
distinguishing successful vetennanans from the average man Only a few 
studies have developed keys distinguishing good from poor men One study 
of salesmen and servicemen for office equipment has demonstrated that this 
method of ti eating interest inventories may have substantial predictive 
power (Ryan and Johnson, 1955) 

27. Assume that certain interest scores of veterinarians are distributed in the man¬ 
ner described below. What advice would a discriminant key of the Strong 
type lead to 7 What advice would be given if expectation of success within the 
field were considered? 

a. In Outdoor interests, veterinarians are higher than the average man, and 
their success is positively correlated with the interest score 

b. In Persuasive interests, veterinarians have the same average as men in gen¬ 
eral Persuasive interests are positively correlated with success in the field 

c. In Social Service interests, veterinarians tend to be below the average f6r 
all men, and the correlation between interests and success is positive 

28. Do you agree with this opinion? 

“Various criteria have been suggested in connection with vocational coun¬ 
seling . . . Success is often employed as a criterion. It is more appropriate 
in connection with aptitude tests than with interest tests But it is doubtful if it 
is as good as it seems. Fifty per cent of people must always be less successful 
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than the average Counseling evaluated on such a basis must always appear 
^ rather ineffective" (Strong, 1955, p. 11) 

Prediction of Academic Criteria 

There is no oveiall interest pattern significant of superior academic per¬ 
formance Among several groups studied at Yale, overachievers were con¬ 
sistently a bit higher m scientific, uplift, and veibal scores on the Stiong, and 
lower on business scoies, but the oveilap of the overachievers and under¬ 
achievers was veiy gieat (R. M. Rust and F. J Ryan, 1954). Other studies 
such as that of Kelly and Fiske confirm this result Conelations of interests 
with grades in specific fields or courses aie geneially below .30, which im¬ 
plies that interest tests add only a small amount to formulas for predicting 
grades Interest scoies may pi edict persistence even if they do not predict 
grades, In one study of dental students 92 peicent of those with A and B+ 
scoies on the Strong Dentist key giaduated, compared with 67 peicent of 
B’s and 25 peicent of C’s (Strong, 1943, p 524). 

Segel (1934) found definite conespondcnce between mteiests and dif¬ 
ferences in achievement between couiscs The conelation of Stiong Engi¬ 
neer mteiests and matliematics-marks-mmus-history-maiks was 61, This is a 
finding of great potential importance in classification and guidance, but un¬ 
fortunately otliei mvesfagatois have not attempted to confirm and extend it 

Specially constiucted keys have had some success in piedicting grades. 
Various “studiousness” keys have been made for the Stiong blank by scoung 
on items which distinguish good and pool achieveis. Mosier (1937) found 
that studiousness scoies and grades con elated 47 for students in liberal arts, 
24 foi engineering students, and only .05 foi business administration majors. 
Though such a score can impiove piediction by some amount, its validity 
must be established in each new situation Hundreds of studies have tried 
to use interest scores to piedict average marks, but no technique has been 
satisfactory enough for practical use The only appioach that can be viewed 
as promising is the piediction of specific marks by means of specific mteiest 
scores chosen on theoretical grounds, as in the studies of Segel and Fiedenk- 
sen and Melville 


INTERESTS AND PERSONALITY 

In adjustment inventories the subject frequently conceals his attitudes and 
feelings The peison is usually pleased with and proud of his mteiests, how¬ 
ever Especially wheie it is understood that the tests will be used to provide 
the person with activities which will interest him, there is likelihood of 
honest self-report Inteiests give clues regarding adjustment and personality, 
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(.Highly intellectual interests or concentration m some field where one has 
developed unusual competence may be an attempt to withdraw from fields 
where one cannot be Suie of supenonty 

Some persons with conflicts ansmg from self-criticism find satisfaction m 
activities which otheis find monotonous Mathematical and clencal woik, foi 
exarnple s appeals to some workers who need to be sui e that what they do is 
right Having added a column of figuies and checked themselves, they can 
feel an assurance they could never have after wilting a story or planning a 
party, 01 m some othei activity wheie "lightness” is less objective Theie 
aie others who can be satisfied only when imposing then own individuality 
upon then woik. Such people fiequently dislike routine or steieotyped ac¬ 
tivities but lespond eageily to artistic tasks wheie onginahty is essential 
This suggests going behind the interest test score m an attempt to infer 
the type of peisonahty consistent with the interests Because it relies on the 
insight of the mterpietei, any such attempt is open to enoi A good many 
empirical studies (Darley and Haganali, 1955, pp. 103-133) have found 
modest relations of interests to peisonahty tests, but clinicians and coun¬ 
selors have not found these studies veiy illuminating because they leveal 
little about the natuie of the stresses within each personality Clinicians 
theiefore fall hack on cumulated expeuence with individual cases Raiely 
is such expeuence collected and systematized, the most complete research of 
this cliaiactei is Anne Roe’s work on eminent scientists (1952, 1957) Hei 
findings, based on interviews and projective tests, deseive close study by 
anyone concerned with vocational counseling 
The implications of peisonahty inteipretations are well illustrated by these 
comments on medical students (E L. Kelly in Gee and Cowles, 1957, pp 
185-196): 

As a group, the medical students reveal remarkably little mteiest m 
the welfaie of human beings For example, one of the shaipest distinc¬ 
tions I can find between a gioup of physicians at Michigan and a gioup 
of clinical psychologists whom we have been studying is [the physicians’ 
higher] Farmei scoie on the Stiong The Faimei key . is 
based on the modal mteiest pattern of highly successful giaduates of sci¬ 
entific agricultural schools . . . Such peisons are not scientific m the 
sense that they want to discovei new truths, then concern is lather the 
application of science toward the goal of increasing production 

Another characteristic of medical students is reflected by their rela¬ 
tively high scoies on the Aviatoi scale . . The one thing they [vari¬ 

ous kinds of pilots] have in common is maleness and a lack of interest in 
anything cultuial. 

Our data suggest that if you want to select the kind of lad who is go- 
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I ing to be interested in public health, general practice, and so on, you 
' should pick the person with a high Strong score on the Carpenter key 
This is a person who has a relatively low upward mobile ambition m our 
society. 

Kelly’s data, from incomplete research in one medical school, illustrate 
that interest scores shed some light on the role the person is likely to per¬ 
form within his profession Among other criteria, sociometnc latmgs were ob¬ 
tained from the student’s peeis, indicating (1) his social relationships, likeli¬ 
hood of becoming a hospital administrator, and peisonal acceptability as a 
colleague, and (2) Ins likelihood of entering some public service lole such as 
medical-school teaching, and willingness to sacrifice high income A miscel¬ 
laneous gioup of Strong scores were used as predictors of these cutena For 
Kelly’s 112 cases, the Stiong Mortician key piedicted the social relationship 
rating (r = 30), and Mathematician and Chemist had negative conelations 
of —.29 with the rating The highest conelations with the rating on service 
orientation weie Carpenter, 44, and Sales Managei, — 42 Other scales 
showing positive conelations between 30 and 39 weie Industnal Aits 
Teacher, Math-Science Teacher, Physicist, and Dentist, negative correla¬ 
tions were shown by Advertising Man, CPA, and the Stiong keys foi sales oc¬ 
cupations. There are many pathways to success m medicine or any other pro¬ 
fession, vocational self-understanding is not complete when the person is 
fitted into a broad occupational category 

The most comprehensive information now available on peisonahty corre¬ 
lates of interest scoies comes from a study at the University of California 
(Block and Petersen, 1955, see Darley and Haganah, 1955, pp 128-129) 
One hundred Air Force officeis were assessed by a great variety of tech¬ 
niques, including tests and interviews. Clinical psychologists recoided their 
impiessions by rating each man on an adjective checklist For each Strong 
scale, a tabulation was made of the personality chaiactenstics on which men 
high on that category differed from the remaining men It was found, for ex¬ 
ample, that the following descuptions tended to fit those with high Mathe¬ 
matician scores concerned with philosophical pioblems, inti ospective, lack¬ 
ing in social poise, lacking confidence m own ability, self-abasmg, reacts 
poorly to stiess, sympathetic, and not ostentatious, aggressive, or socially as¬ 
cendant 

It is important to warn against interpietmg paiticular interest patterns as 
indicating good oi “bad” personalities Peihaps the reader has already been 
inclined to judge the mathematically onented officers described above as be¬ 
ing maladjusted Any such value judgment can be made only on the basis of 
some peisonality theory which is open to challenge Most contemporary psy¬ 
chological writing appeals to assume that the ideal personality is confident, 
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interested m social contacts, and effective as a leader. Roe (1952), however, 
points out that many distinguished, highly effective, and apparently con¬ 
tented physical and biological scientists are not at all socially oriented They 
caie little for making friendships 01 for earning the good opinions of others 
Eminent and effective psychologists (including laboratoiy experimenters), 
on the othei hand, typically aie concerned with having good relationships 
with others Roe finds that both groups had had difficulties with social rela¬ 
tionships at some time in their pieadult development, and believes that each 
group chose a different method of adjusting successfully to these difficulties 
The physical scientists became absorbed in tasks not involving other persons, 
while the psychologists made other persons their piofessional concern This 
leads Roe to question whethei psychologists, merely because their personali¬ 
ties have now crystallized about an active relationship with others, build 
such a relationship into the definition of "good adjustment” they apply to oth¬ 
ers Quite possibly, says Roe (1953), the psychologists aie cntical of effective 
and healthy patterns of adjustment which do not coincide with their own, 
Conversely, if physical scientists weie to define the healthy peisonahty aftei 
studying all the data available to psychologists, their ideal might place little 
emphasis on wann friendships and ability to lead, and a great deal of em¬ 
phasis on responsibility, freedom fiom suggestibility, and independence of 
group opinion This aigument is suppoited by another study of latings of 
graduate students at the Univeisity of California To the clinical psycholo¬ 
gists, "soundness” of personality depends strongly upon warmth in interpei- 
sonal relations, and eccentucity oi deviation fiom the noim is regarded with 
suspicion When faculty membeis late the same students, however, "sound¬ 
ness” is judged almost entucly by the student’s effectiveness in getting his 
woik done (Bairon, 1954). 

USE OF INTEREST TESTS IN COUNSELING 

Interest inventories aie larely used for selection or for administrative deci¬ 
sions about classification, even wheie suitable sconng formulas peimit valid 
piediction Historically, interest tests have always been a method foi helping 
the individual attain satisfaction foi himself rather than a method foi satis¬ 
fying mstilutions. As a lesult, the interest inventory is used almost entirely m 
academic and vocational counseling, 

One may conceive such counseling as intended to arrive at a decision— 
l e, of selecting a definite goal and working out a tiainmg and career plan— 
or one may conceive the counseling as intended to promote the client’s un¬ 
derstanding of himself Moie and more, counselors are shifting to the second 
point of view As we have pointed out eailier, vocational development neces¬ 
sarily involves new choices as new facts become available, as the mdividual 
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matures, ol as his social circumstances and oppoitumties change* The stu¬ 
dent and counselor in high school may set down a definite plan to study cer¬ 
tain subjects, to enioll in a certain college cuiriculum, to complete training ' 
in a ceitam professional school, and to find an oppoitumty to entei a ceitain 
type of piactice. This plan lias an almost negligible probability of being car¬ 
ried out Somewheie along the lme mstiuctois will open new vistas to the 
student 01 arouse new mLeiests. Somewhere along the line concrete experi¬ 
ence will show him that he does not enjoy some aspect of the work and will 
leveal an unsuspected talent in another direction Counseling should realis¬ 
tically assume that any plan is a road with many blanches In a good plan 
most of the branches aie conceivably appiopnate foi the client to follow, and 
he is able to leach any of the goals which at piesent seem most appiopnate. 

If the goal m counseling is not to be definite planning, save with regard to 
immediate decisions, the goal must be to equip the student to make future 
decisions as choice-points are reached The aim in counseling should be to 
give the student a moie sophisticated view of the woild of work, of the 
choices open to him, and of his own lange of potentialities for achievement 
and satisfaction. 

Inteiest inventoiies aie peculiarly well adapted to vocational counseling 
The student expects lus mteiests to be considered, and he is not thieatened 
by the questionnane as he might be by personality 01 ability tests The inter¬ 
pretation, when given, carries considerable foice, because the student can 
see that he is looking at himself m a mmoi, that he is only receiving an analy¬ 
sis of what he himself has said. No psychological mysteries becloud the m- 
teiest test as they do tests involving moie esoteric constiucts of personality 
and aptitude From the counselor’s point of view also, the interest inventory 
is less fraught with emotional significance The counselor hesitates to tell a 
student his aptitude and personality tests scoies unless there is ample evi¬ 
dence that he can accept and comprehend the findings Scores on interest 
tests can be discussed fieely, howevei, while they may lequire the student 
to examine discrepancies within his self-concept, they raiely threaten his es¬ 
teem. 

Foi the counseloi 01 high-school instructor who wishes to encourage think¬ 
ing about futuie plans, the mteiest inventory is a helpful device It can be 
given to entne classes 01 entue student bodies. Students aie quite willing to 
leveal their intciests and aie eagei to have a leport of scores Although there 
is some nsk of misundeistandmg, interpretation of piofiles can be carried out 
in gionp discussions rather than in individual counseling (Layton, 1958, pp 
32 ft ) Such a piocess, leading each student to list vocational possibilities 
suggested for him by the test, is an excellent piehminaiy either to further 
gioup study of caieers or to individual counseling. 

The interest inventory also assists counselors in dealing with many other 
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student problems A promise to interpiet interest scores is an excellent, non- 
threatening gambit to entice the student into the counselors office In the 
course of the discussion of vocations he will necessarily talk about his family, 
his social relations, and his academic difficulties, and so may touch upon 
problems for which the counseloi can piovide help ranging fiom a diagnos¬ 
tic reading test to psychotherapy, The conference opens a natural opportu¬ 
nity foi the student to expiess his desne for such help, a desue which he 
might otherwise never have acknowledged even to himself. 

In view of the aims described above, it is most unwise to concentiate the 
interview upon an analysis of scores foi specific occupations This tactic gives 
the student far too nanow a description of himself and leaves too many 
things out of consideration It is absolutely essential that the student should 
go beneath occupational labels and stereotypes, that he should undeistand 
the variety of loles diffeient members of the same occupation play, that he 
should understand the chfleiences between demands of the tiainmg pro¬ 
gram and demands of the occupation, and that he should lecogmze the shift¬ 
ing natuie of occupations He must consider his abilities and academic pros¬ 
pects, the piessures fiom his family, his motivations and values, his financial 
resources, and the probability that his present inteiests may shift 
Darley and Haganah (1955, p 195), speaking from this point of view, 
sharply criticize some common piactices m vocational counseling They take 
as an example the student with peak mteiests m the social seivice group. 

At some point in the counseling interview senes, the counseloi can 
make this bald statement “You have the same kind of interests as suc¬ 
cessful peisonnel manageis oi Y.M C A secretaries oi school superin¬ 
tendents ” With mmoi modifications, this is piobably the standaid ap- 
pioach to mteipretation It is also the least effective approach and the 
one most likely to lead the student and coimselor into ever deeper mo¬ 
rasses of mteipietive difficulties, 

They give eight leasons for condemning this appioach, most of which we 
have touched on alieady Most specifically, such an appioach immediately 
causes the student to think in terms of occupational steieotypes, instead of 
trying to see what interests of his match activities common m the jobs men¬ 
tioned Moreover, since he may attach a negative connotation to—let us say 
—"school superintendent" if he sees himself as a business executive m an all¬ 
male woild, the student may find it necessary to lesist the test intei pi etation. 

Instead of a narrowly occupational interpretation, counselois should help 
the student identify tire gioups of activities m which he has expiessed inter¬ 
ests. The Kudei scores lead directly to this type of mterpi etation, and the qc- 
cupationakgfoups of the Strong aie similaily a good starting point for discus¬ 
sion. A high scoie in literary mterests, for example, can be amplified by 
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questioning which will clarify whether this is an interest m reading, m writ¬ 
ing, or in speaking, whether it is an interest in face-to-face verbal activities 
or in isolative veibal activity, and whether it is accompanied by any evidence - 
of talent in expression. The discussion will ultimately come aiound to specific 
vocations used as examples of ways in which the expiessed mteicsts might be 
satisfied Such illustrative vocations can be selected by the counseloi m the 
light of the student’s claimed interests, his probable ultimate level of educa¬ 
tion, and his abilities They may or may not conespond to the limited num¬ 
ber of occupations for which keys exist 

It is particularly necessary to reconcile diffeienccs between claimed mter- 
ests and measured mteiests m a way that is emotionally acceptable to the 
student To have told Mai y Thomas, “You don’t leally want to woik m child 
development, you want to be a secretary,” would have piecipitated an emo¬ 
tional conflict. No one can abandon a long-standing self-concept easily. An 
authonty who bluntly contradicts film beliefs invites the counselee to reject 
him as an authority In Mary’s case, it might have been bettei to inqune as 
to the leasons for her choice of child development, to ask hei to envision the 
activities she might be engaged m ten yeais hence, and to compare those 
with the activities lated high in the interest blank The fact that the inven¬ 
tory contains only her own ratings brings her face to face with hei sell-con- 
tradictions, The psychologist is no longer the “authority”, he is meiely hold¬ 
ing the mirror for her 

We may compare the three types of mteiest inventories in terms of then 
suitability for counseling puiposes The Strong blank is undoubtedly the 
most highly developed and best understood of the inventories, indeed, it 
ranks very near the top among psychological tests of all types. 1 The specific 
occupational keys with complex weights aie rathei inconvenient to scoie by 
hand and, indeed, may be no more valid than keys with unit weights Those 
who prefer the Stiong ovei its competitors will not find the cost and delay of 
electronic scoung a severe handicap. The scoring charge is cunently about 
$1 per person The gieat number of keys make intei pi etation both uch and 
complex. Speaking generally, this militates against its use by high-school 
teachers and relatively unliained counselois, and against its use in mass 
counseling piogiams But its length and complexity, togethei with its re¬ 
search foundation, make the Stiong tire pieferred instrument of most highly 
trained counselors and psychologists dealing with college students The 
blank is relatively unsatisfactory for clients entering occupations below the 
professional-managerial level 

1 The amount of effort involved m painstaking test construction is indicated by Stiong’s 
report that over $45,000 was spent, and blanks from over 23,000 people weie obtained, 
m the course of his research 
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( The Kudei blank has a much simpler format Even ninth-graders can take 
the test m groups, scoie their own tests, and plot their own piofiles. Theie is 
evident dangei if this invites teachers to leave interpretation to the students, 
but this is not a necessary fault of the mventoiy The scores lend themselves 
diiectly to inteipretation in tenns of patterns of activity, with vocational 
interpretation secondaiy Foi tins reason scoies of the Kuder type seem 
slightly preferable to those of the Strong type, especially m the hands of 
counselors with limited training It is of interest that Dailey, a leading advo¬ 
cate of the Stiong test, urges that it be intei pi eted in teims of intei est cate- 
gones such as technical, social service, business detail, and veibal-lmguistic, 
lathei than m terms of the occupational scoies per se While the Strong can 
be so treated the Kuder is designed foi just tins use It appears moie suitable 
than the Stiong foi gnls, and foi students headed for lowei-level occupa¬ 
tions Botli the Stiong and the Kudei mventoiics are long and tedious, which 
makes them somewhat unsuited foi application to large gioups. Canfield 
(1953) has shown that Kudei profiles are not gieatly altered if only the odd 
pages aie admimsteied, thus cutting testing time in half He has piepaicd 
norm tables for this short foim. (See also Claik and Gee, 1954) 
Inventories such as the Lee-Thorpe, developed on a logical or content¬ 
sampling basis, aie much hardei to evaluate The items are more diiectly 
descriptive of vocations than are those of the Stiong and Kuder, and are 
therefore moie likely to invite lesponses on the basis of stereotypes A moie 
senous difficulty is that one cannot say whethei the category scores lepresent 
suitable constiucts for descubmg individuals The heteiogeneous mixture 
of activities called “mechanical” by Lee-Thorpe define a less clear intei est 
pattern than the mechanical items of Kuder, whose intercorrelations have 
been established empirically. The Lee-Thorpe mventoiy would be a moie 
useful mstiument if considerable empirical woik weie done to revise the 
gioupmgs and to provide a background of facts with which to interpret 
scores In the absence of such facts, scores on Lee-Thorpe categories appear 
to deserve little emphasis. The items, covenng a wide range of occupations, 
may be regarded by counselois as a checklist or pencil-paper interview. In¬ 
numerable leads for intei viewing will come out of consideration of the sepa¬ 
rate items 

A final consideiation in choosing between inventories is a statistical com- 
pai isonof reliabilities, mterconelations, and lelations with criteria. Unfortu¬ 
nately, the available information is spotty at best, since only a few isolated 
studies have admimsteied two or more inventories to the same sample The 
Stiong and the Kudei are about equally reliable, with the Strong having a 
slight advantage The “conespondmg” keys sometimes agree veiy closely but 
at other tunes seem to have different psychological meanings Difieiences of 
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this land indicate how important it is 
lamiliai with the particulai lest he is 
the meanings of scores. 


for the counselor to become thoroughly 
using and with the research indicating 


29 The Thurstone Interest Schedule consists of a set of paired comparisons of oc¬ 
cupational titles such as Engineer-vs -Accountant. Titles are assigned to areas 
on a logical basis. The choices made in a given area are counted, yielding 
scores in ten areas such as Physical Science, Business, Linguistic, and Humani¬ 
tarian. The scale requires about ten minutes Discuss the advantages and dis¬ 
advantages of such an inventory for counseling 
30 . The Thurstone profile is expressed in terms of percentage of choices in each 
area. No norms are used, interpretation being based on the shape of the raw- 
score profile Is this advantageous or disadvantageous 7 
31 Compare the extent to which the Strong and the Kuder are influenced by re¬ 
sponse style (p 372) 

32 . An Interest test is to be used in helping junior-college freshmen make voca¬ 
tional plans Among the training programs offered are those for photoen¬ 
graver, dietitian, and others not directly represented in the Strong and Kuder 
keys How could each test be extended to assist students in |udgmg whether 
they would like these fields? Which test seems to be more adaptable? 

33 . What errors are likely to occur when ninth-grade students score and interpret 
their own Kuder profiles, in the course of several days of class discussion? How 
can the teacher reduce such risks 7 


PROSPECTIVE DEVELOPMENTS 

Interest testing and test intei pretation have changed maikedly Since Strong’s 
test was fiist published, and there is much leason to expect continued rapid 
development The lmtnl mvestigitions weie blunt empirical compansons of’ 
occupational groups on items selected almost at random There was no the- 
oiy as to the nature of mteiests or as to the types of interests most deserving 
consideiation. theie was no theoiy about the stiuctuie of occupations and 
caieers, and the intei pi elations placed on test scoies wcie entnely prag¬ 
matic and unpsychological. KudeTs appioacli and Stiong’s factor analyses 
led to the beginning of a theory of mteiests, Guilfoid and his associates, in a 
tentative but compiohensivc study (1954), found ovei twenty inteiest fac- 
tois, most ot which seem to leHect general personality styles rathei than vo¬ 
cational onentations. Guilfoid recognized several familiar factois such as 
mechanical, scientific, and social-welfare, but he adds adventuie vs security, 
aesthetic appreciation, cultuial confomuty oi oiderlmess, need foi diversion, 
aggiession, and many othei inteiest dimensions. Longitudinal studies and 
case histones have also begun to present a clearer pictuie of the significance 
of mteiests The ongmal ciude empnicisni lias declined in impoitance, 
The next development that may be foiecast is the systematic uiteipreta- 
tion of mteiests m teims of moie fundamental personality constructs The 
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work of Roe, Tyler, Kelly, Guilfoid, and the California investigators all lepre- 
sent preliminary steps m this direction Wheieas interests were once viewed 
almost as a product of chance conditionings, today it is thought that interests 
are an expression of deeply looted needs and adjustment patterns Interest 
tests are perhaps supenoi to many other techniques for assessing personality 
because of their diverse content and then acceptability to the subject But 
considerable research must be done to place such interpretations on a sound 
footing 

Every current writer on mteiests and on vocational counseling stresses 
the gieat need for a theory of interest development and foi a theory of occu¬ 
pational adjustment to replace the piesent piecemeal collections of facts 
Much curient research is intended to pioduce at least the beginnings of such 
a theory, and as it emeiges this theory will no doubt have ladical effects 
upon test interpretation It will also leopen questions as to the appiopnate 
age to begin the measurement of interests, and the appiopnate items to be 
used. 

LISTING OF INTEREST INVENTORIES 

Among the interest inventories currently in use are the following: 

0 Kuder Preference Record, Occupational, Foim D; G Fiedenc Kuder; 
Science Research Associates, 1956. A collection of 100 forced-choice items 
drawn from the Kuder Vocational and Personal inventories Intended for in¬ 
stitutions which wish to develop keys to place the most suitable peisons in 
particular jobs For use in guidance, Kudei is dcvelopmg and releasing keys 
for various occupations. The 1957 manuals discuss keys for 22 occupations, 
mcluding those of electrical engineer, faimer, minister, etc The information 
available to date on discrimination between occupations is encouraging, but 
until longitudinal studies and correlations with job satisfaction are avail¬ 
able, it is not possible to judge whethei this inventory can become as service¬ 
able as the much longer Stiong inventory. 

« Kuder Preference Recoid, Vocational, Form C, G Frederic Kuder, 
Science Research Associates, 1939,1951 For high“schodl students and adults. 
A descriptive blank yielding ten scores showing the person’s percentile 
standing in various inteiest categories (See pp 412ff ) 

0 Guilford-Shneidman-Zimmerman Inteiest Survey, J P. Guilford and 
others; Shendan Supply Company, 1948 Foi high-school students and 
adults An inventory based on factor analysis which identifies nine categories, 
each of which has two subscores (eg, aesthetic appreciation vs expression) 
Guilford’s later work suggests revision and extension of the categories The 
instrument is primarily suitable for research on interest development rather 
than guidance in its present stage of development. 
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• Minnesota Vocational Inteiest Inventory, Kenneth E Clark, unpub¬ 
lished For high-school students and adults. Can be expected, when pub¬ 
lished, to fill an important place in counseling and classification Foiced- 
choice triads are scored empirically to indicate how closely an individuals 
interests resemble those of men in various trades such as bakeis, plasteiers, 
retail sales clerks, and truck driveis The inventory thus covers a portion of 
the occupational range for which the SVIB is inadequate 

« Occupational Inteiest Inventory, Edwin B Lee and Louis P Thorpe, 
California Test Bureau, 1943, 1956 Foi Grades 7 upward Yields scores for 
six fields (personal-social, arts, business, etc ) and anothei set of scores for 
verbal, manipulative, and computational interests (See pp 415, 435 ) 

• Strong Vocational Interest Blank for Hen, E K Shong, Jr , Consulting 
Psychologists Press, 1927, 1951, with supplementary research reports For 
high-school students and adults. The outstanding example of an empirically 
scored inteiest inventoiy Keys for 47 occupations, plus group factors (See 
pp 406 ff) 

» Strong Vocational Interest Blank for Women, E K Strong, Ji , Consult- 
mg Psychologists Press, 1947, 1951 Foi high-school students and adults 
Scores for 27 occupations This mstiument has not shown satisfactoiy validi¬ 
ties and is larely used In counseling women who plan to enter occupations 
foi which the men’s blank is scoied, it is pieferable to use the men’s blank, 
a Vocational Interest Analyses, Edward C Roeber and Gerald G 
Pndeaux, California Test Bureau, 1951. Grades 9 and up. lo be used as a 
second step, following rough mapping of mterests by the Lee-Thoipe mven- 
toiy This instrument has six sections of 120 items each, corresponding to the 
sections of the Lee-Thorpe The counselor administers those sections corre¬ 
sponding to high scores on the first test to obtain a moie detailed analysis of 
mterests within the area As no evidence of validity is available, the inven¬ 
tory should be regarded as a written inteiview rather than as a scoied test 


Suggested Readings 

Cullis, Robert, Polmantier, Paul C , & Rocbei, Edward C The case of Bill Davis. 
A casebook of counseling New York- Appleton-Centmy-Crofts, 1955 Pp 77- 

103 , , V 

In tiansciibed interview notes a senior engmeenng student who expiesses his 
lack of interest in engineering goes ovei his Stiong and Kuder piofiles with a 

counselor . . . 

Darley, John G., & Haganah, Theda The Shong Vocational Interest Blank m 
individual cases Vocational interest measurement theory and 'practice. Min¬ 
neapolis University of Minnesota Pi ess, 1955 Pp 194-263 

The authors explain how to proceed from the profile showing primary mteres 
areas to a discussion of specific occupational choices Ten cases are described, 
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showing how the use of SVIB information varies accoidmg to the client’s 
aptitudes, maturity of self-concept, and background influences Of paiticular 
interest is a history (Karl Biooks) showing development from age 14 to 25 
Kudei, G Fredenc Research methods for development of an occupational key 
Research handbook for the Kuder Occupational Preference Record Occupa¬ 
tional Chicago, Science Reseaich Associates, 1957, Pp 27-38 
This is a bnef account of the procedures used in developing and testing the 
efficiency of a key to distinguish persons of one type fiom men m general 
Strong, Edward K, Jr Interpretation of interest profiles Vocational interests of 
men and women Stanfoid Stanfoid University Press, 1943 Pp 412-456 

Strong presents data on typical patterns among college students and suggests 
how the relevant information can best be conveyed to the student seeking 
guidance 
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General Problems in Personality 

Measurement 


PERSONALITY, attitude, and interest measures weie introduced in Chap- 
tei 2 as measures of “typical behavior,” and thus distinguished fiom the abil¬ 
ity tests, which measure maximum peifoimance In assessing typical behav- 
ioi, the investigator wants to know what the peison normally does rather 
than what he can do undei exceptional motivation, In this chapter, we shall 
examine the notion of “typical behavior” and compare vanous ways of gath¬ 
ering and interpreting mfoimation about it Before proceeding to this discus¬ 
sion, it would be wise to reiead the introduction to procedures used to inves¬ 
tigate typical behavior given on pages 31 to 34. 


TYPES OF DATA 

Observations in Representative Situations 

The logical way to determine typical behavior would be to observe the in¬ 
dividual repeatedly in situations likely to reveal the aspect of personality in 
which we are intei ested. To study mteiests, one would observe what the 
person does during his leisure To evaluate a businessman’s geneiosity one 
would obseive his responses to charitable appeals, his tipping, and his deal¬ 
ings with employees 

The first requirement is a sufficient number of suitable observations No 
one act can be taken as typical, since it is influenced by mood, immediately 
preceding expenence, details of the surroundings, and othei factors. There 
are cycles and trends in behavior If a subject appears quairelsome on sev¬ 
eral occasions, quarrelsomeness seems typical for him. Peihaps, however, he 
is in a continuing state of irritability due to some worry, and some months 
earlier or later he would appear well adjusted. It could be argued that we 

440 
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took too small a sample and as a result had observed a mere temporary devi¬ 
ation. Yet the deviation was ical, and the behavior reported was typical for 
the subject during that time, 

Caieful attention must be paid to definition in attempting to observe typi¬ 
cal behaviors One can be sure what a lepoit of typical behavior means only 
when the observer specifies die lange of time represented in the data, the 
range of situations, and the range of motivations When these aie not speci¬ 
fied formally, they aie often implied m the description of the observation 
method 

The second procedural reqmrement is that the act of observing must not 
alter, die behavior observed Just as the presence of a tiaffic cop at an inter¬ 
section raises drivers fiom their habitual level to their best ability, so the 
presence of the observer may cause the subject to tiy hardei. This seems to 
occur even when no lewaid or punishment will result Roethlisberger and 
Dickson attempted to compare woik output undei various conditions at a 
Western Electric manufactuung plant Relay assembleis were placed m a 
small experimental room where they could be observed and their output re- 
coided in great detail Various experimental lest pauses and privileges were 
introduced, as each change was introduced, no matter what it was, produc¬ 
tion climbed Finally, in the twelfth and thuteenth peiiods, the lest pauses 
and privileges were removed, and production pei hour sdll remained as high 
as under the “best” working conditions. Another stiikmg change was that ab¬ 
senteeism dropped fiom 15 2 days per year per woiker before entering the 
study to 3 5 days pei year in the test room The heightened morale of die 
workers—as a result of being singled out for study, of being better ac¬ 
quainted with their supeivisors, and of feeling peisonal responsibility for 
then late of output—changed their performance so that it was no longer 
comparable to that m the regulai workroom. 

Distortion is less when the judging is a regular part of the work piocedure 
Ratings by foiemen can reasonably be regarded as reports of typical behav¬ 
ior in the plant, for the foieman is usually piesent, how the man would act if 
no foreman were piovided is not of interest Wherever the rater or observer 
is a regular member of die group, Ins presence will have little distorting ef¬ 
fect. 

An ideally random sample of the subject’s total behavior can never be ob¬ 
served. Those moments of lus life which are open to die psychologist’s in¬ 
spection ai e by no means typical. It is a fantasy to think of assessing generos¬ 
ity by tabulating the businessman’s responses to appeals, these pnvate 
moments are not open to observation Observation m lepiesentative situa¬ 
tions can be used only to learn about the individual’s typical public behavior' 
in classrooms, On playgrounds, and in certain work situations Indeed, direct 
observation of samples of “natural” behavior is restricted almost entirely to 
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research, particularly research on children who are young enough to ignore 
the presence of the observer. 

1. Define the range, in time and situations, of the behavior which should be studied 

to answer these questions 

a. How well does this supervisor handle grievances 7 

b. Does study of philosophy make an adult more rational in his daily life? 

c. Does viewing a film on nutrition improve housewives’ practices in menu plan¬ 
ning 7 

d. Do graduates of the modern elementary school write legibly? 

e. How anxious is this patient at this point in therapy? 

Reports from Others and from the Subject 

If we are willing to sacrifice the precision and detachment of the scientific 
observei, we can obtain useful infoimation from the subject’s acquaintances 
and cowoikers The lating by a foreman is moie nearly a geneial impression 
than a dependable lecoid of typical behavioi, but it is nonetheless useful 
Similarly, mothers give infoimation about children, nuises about patients, 
and so on 

It is common to regard such latings and descriptions as information given 
by a competent authority who is in a sense the professional ally of the psy¬ 
chologist Any one such report, howevei, must be regarded as one individ¬ 
ual's perception of another, subject to as much distoition as any perception 
of a fluctuating, ambiguous stimulus Indeed, investigatois are now begin¬ 
ning to use such reports as information about the peisonality of the rater 
rather than solely as information about the person observed 

Sociometuc-or peer -iatm g techniques obtain repoits on the individual from 
his fellows Children rate other members of their school gioups, college 
girls late other members of their soronties, and oflicei candidates late class¬ 
mates. The mteipieter may assume that such a lepoit is a valid summaiy of 
typical behavior 01 , refusing to make this assumption, may still be interested 
in the report as evidence of the impiession the subject makes on others 

Similar comments may be made about the self-ieport, The subject is in¬ 
deed an authority on his own behavior, but there are distoitions in his per¬ 
ception of himself and m Ins leport We shall discuss the mterpietation of 
self-reports at some length below 

Performance Tests 

Obtaining reports from others, or self-repoits, avoids some of the difficul¬ 
ties of field observation Such reports can (in principle) shed light on cor¬ 
ners of die subject’s hfe where the observer may never go, can cover past 
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behavior which is no longer observable, and can take into account far moie 
incidents than any observer could record. These benefits are offset, howevei, 
by the distortions which result when subjective impressions aie substituted 
foi piecise quantitative records Performance tests seek to obtain precise and 
dependable information—but they do so by giving up the attempt to take a 
representative sample Just as in a measuie of aptitude 01 achievement, the 
testei places the individual in a standardized situation to which he must re¬ 
spond His performance is evaluated eithei by an objective peifoimance 
score or by observation of the way he responds 
A great vanety of techniques fall into this categoiy One might measuie 
mteiests, for example, by a current events test covering developments m sci¬ 
ence, engineering, music, public affans, and so on Knowledge in the vanous 
fields is to some degiee a leflection of lelative interest One might allow the 
subject a supposed “rest penod” during a batteiy of othei tests and let him 
browse in a libraiy, the books which attract his attention might be piesumed 
to represent his inteiests. A thud approach is to lequne him to make up sto¬ 
nes 4bout pictures showing people at work m settings such as a hospital op- 
eiating room The ideas and feelings he altnbutes to the characteis m the 
pictuies may indicate his attitudes about vanous types of woik 
There is no accepted classification system for performance tests Cattell 
has pioposed that the term objective test be applied to devices like the cur¬ 
rent events test of interests which yield a direct measuie of peiformanee un¬ 
modified by any observation or interpretation This name, howevei, is not 
accepted by other workers. The name situation test has also been applied to 
tests of performance m complex, lifelike situations It was first used for woik- 
samples of leadership The candidate for a leadership position was placed m 
a standardized situation, given a crew of men, and obseived as he duected 
them Anothei subcategoiy is the protective technique A piojective tech¬ 
nique gives the subject material with which to work cieatively, eg,, the 
tester presents an ambiguous stimulus (inkblot, picluie, unfinished stoiy, 
etc ) and asks the subject what he sees m it 01 what he thinks will happen 
next These interpretations aie legarded as projections of the subject’s un¬ 
conscious wishes, attitudes, and conceptions of the woild 
The great advantage of the performance test is that it permits fair compari¬ 
son qfhndividuals A lating of leadeiship may reflect differences in opportu¬ 
nity rather than diffeiences in readiness to lead, but a performance test 
gives each individual m turn the same opportunity to lead Individual dif¬ 
ferences in use of that opportunity reflect personality. Behavior m tins stand¬ 
ardized situation may be fai from "typical ” At best, we obtain a sample of 
response to a very special stimulus, namely, a-leadership-oppoitumty-when- 
being-tested-by-a-psychologist-whose-good-opmion-will-have-certain-conse- 
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quences The performance test gives neat data, but intei pretation is far more 
difficult than inteipxetation of observations m representative situations or re¬ 
ports from others. 

2 How well can the four procedures—observation in representative situations, re¬ 
port from others, self-report, and performance tests—satisfy the following re¬ 
quirements? Rank them from best to poorest in each respect 
a The data reflect differences in personality rather than differences in environ¬ 
ment and opportunity 

b. The data reflect the individual's behavior, undistorted by perception of those 
who provide data 

c. The data provide a summary or estimate of the individual's behavior during 
ail moments of His life, 

d. The results are the same, whether or not the sub|ect wishes to make a good 
impression on the psychologist 

THE SELF-DESCRIPTION AS A REPORT OF 
TYPICAL BEHAVIOR 

The simplest view of the self-repoit is to treat it as a lecord of typical behav¬ 
ior, which the subject is in a uniquely excellent position to obseive. Theie is 
some justification for so mteipieting the inteiest inventory, since the person 
seeking guidance wants his interests to be satisfied in the woilc he selects, 
Even the intei est mventoiy, however, is not pi oof against distoition as a re¬ 
sult of status aspnations. In other questionnanes, there are many sources of 
distortion which prevent accepting the score as a tiue summaiy of behavior. 

The fust;.difficulty in questionnaire interpietation is that items aie some- 
wh at ambigu ous “Do you make fnends easily?” seems a sti aightforward 
question, but it is haid to say just what behavioi the question refers to, and 
what the tester means by easiltj The subject, reflecting upon his past behav¬ 
ior, is unable to count up paiticular incidents If he could, his repoit would 
be a simple factual statement But he will lecall some cases where he formed 
a fuendship quickly and othei cases wheie an acquaintance remained some¬ 
what distant over many months. If he tries to push for a moie literal interpre¬ 
tation of tire question, he soon bogs clown What does friend, mean—close 
and intimate companionship, pleasant intei action without emotional involve¬ 
ment, or something m between? The subject taking a questionnaire does not 
ask such fussy questions (though a scientific obseiver tabulating typical be¬ 
havioi would have to). The subject answeis the question in teims of a gen- 
cial feeling or self-concept. If he legards himself as being the type who 
makes fnends easily—hang niceties of definition!—he says “yes” to the ques¬ 
tion Another equally populai boy may have a cliffeient self-concept and re¬ 
spond “no.” 
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Similar difficulty arises because most questionnaires ask about responses to 
the hypothetical “typical” situation, instead of asking about 1 esponse m well- 
defined situations “Do you seek suggestions from others?” is a fairly clear 
question, but most people would have to answei, “Sometimes I do, but not 
always ” This might be further qualified' “I do on difficult pioblems”; "I do 
if someone is around whose ideas are especially good”, “I don’t if I’m sup¬ 
posed to make the decision myself,” These qualifications would have to be 
stated if the subject tiled seriously to lepoit typical behavior Since he can¬ 
not average his memoues to deteimme what percentage of the time he has 
sought suggestions, the question will be answered offhand When one person 
defines “yes” to mean “with veiy few exceptions” and another defines it as 
“fairly often; at least w difficult situations,” they are answenng diffeient 
questions and theii responses aie not comparable Anothei example is the 
apparently clear item' “Do you like to opeiate an adding machine?” Many 
students say that they enjoy this but would be dissatisfied with a job where 
they had nothing to do but operate an adding machine It is impossible to 
qualify items to eliminate such pioblems of inteipretation 
Many self-ieport tests piovide a response scale using such woids as “al¬ 
ways "^frequently,” “seldom,” and “never.” Simpson (1944) examined how 

TABLE 58. Range pf Meanings Assigned to Words 
Commonly Used in Personality Inventories 


What Percentage of All Occqsions Is 
Indicated by the Word at Left 2 
Range of Answers 
Median of Middle 50 Per- 
Answer cent of Subjects 


Usually 

85 

70-90 

Often 

78 

65-85 

Frequently 

73 

40-80 

Sometimes 

20 

13-35 

Occasionally 

20 

10-33 

Seldom 

10 

6-18 

Rarely 

5 

3-10 


Sound Simpson, 1944 


such ratings might compare with quantitative obseivations He asked stu¬ 
dents what percentage fiequency of a particular response would correspond 
to a leport that this was what they “usually” did Twenty-five percent of them 
applied “usually” only to events occumng at least 90 percent of the time, an¬ 
other 25 percent said that “usually” meant a frequency below 70 peicent 
The quantitative inteipretation of othei woids is shown in Table 58. It is 
evident that two subjects with identical behavioi may choose entirely differ¬ 
ent adverbs to describe what they do. 
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Response Styles 

The use of fixed lesponse categories such as ‘Yes, Agree, and “Like” 
makes questionnaires particularly subject to individual response biases, This 
was first noted by Lorge (1937, see Cronbach, 1946, 1950), who counted 
how often people responded “Like” to SVIB items. One subject uses the 
woid for every activity be would not positively dislike, another applies the 
word only to activities to which he is strongly attached Such differences can 
lead to quite different Strong profiles 

Where response style has consideiable effect it becomes difficult or impos¬ 
sible to interpiet self-ieports as if then face content were true (Jackson and 
Messick, 1958). The California F scale is a questionnaire developed for re- 
seaich on authontanan peisonahties (Adorno, et al, 1950) The items con¬ 
sist of strongly worded opmions most of which express a critical attitude 
about human nature People who endorse these items tend to show other 
symptoms of readiness to follow strong, repressive political leadership, and 
hence the scale is labeled F, foi "fascist ” It was supposed, for a number of 
yeais following the publication of this scale, that it gave a dependable report 
of the subject’s attitudes Latei, it was noted that virtually all the items in 
the scale were woided in hostile language, so that the response “Yes” was 
consistently scored as undesirable Seveial investigatois questioned whether 
the scoie i effected the content of attitudes or an acquiescent response style 
To investigate this, a “leflected” scale was constructed For each item an al¬ 
ternate veision was written which had the opposite ostensible meaning We 
may label this scale F' Then a pair of items might be 

(F) Obedience and respect for authonty are the most impoitant virtues children 
should learn 

(F') Self-reliance and lack of need to submit to authonty aie the most important 
virtues childien should learn 

In one study tire number of statements marked in tire authontanan manner 
on the two scales (“Yes on F, “No” on F') corielated only .20 The correla¬ 
tion would be 50 or beyond if responses were determined primarily by the 
content of altitudes (Bass, 1955, Messick and Jackson, 1957, Ancona, 1954, 
Chapman and Campbell, 1957). 

Faking 

The tester would like to view his inquiry as a scientific project to which 
the subject is willing to contribute valid information The subject comes to 
the test with a quite different purpose. In a clinical test, he may want to 
avoid certain threatening diagnoses. In employment testing, his first con- 
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cern is to land the job In vocational guidance, he may be more concerned 
with convincing the tester that he should enter a certain occupation than 
with learning the truth about his suitability foi it When mdustiial woikers 
filled out identical health questionnaires under two conditions, the results 
were strikingly diffeient. One questionnaire was turned m to the company 
medical department, as a preliminary to a medical examination designed to 
improve the woikers health The other questionnaire was mailed duectly to 
a research team at a university. The workeis listed far moie symptoms on the 
research questionnaire (which would not help them) than on the othei, even 
though an honest report to the company physician might bring them medical 
help (Streib, unpublished) 

As might be expected, the subject most often presents himself in a favor¬ 
able light (Edwards, 1957) On an item (eg, Do you make friends easily?) 
where one lesponse is socially desnable, the great majority of subjects will 
give that desirable answer The subject’s tendency to make favorable state¬ 
ments about himself, i e , to put up a good front, is often refeired to as a “fa- 
gade” effect Striving to make a favorable impression can be identified by 
counting how often favorable self-desciiptions aie checked A high fagade 
score may occur, of couise, because the person is truly superioi in behavior 
and adjustment, but persons whose behavior approaches the ideal on many 
dimensions aie so unusual that fakmg is suspected 

Not all the subjects “fake good ” Some deliberately give an unfavorable 
pictuie of themselves A draftee who believes that a pool score on a personal¬ 
ity questionnaire will get him a discharge may report an astonishing array 
of emotional symptoms In an ordinary clinical test, exaggerating symptoms 
may be a gambit to enlist sympathy and attention. The subject may prefer 
to have the tester believe that troubles as a student are due to emotional dis¬ 
turbance than to be thought stupid 01 lazy. 

A type of distortion which confounds evaluation of psychotherapy is the 
so-called “hello-goodby” effect Upon entering the clinic, the client tends to 
present die worst conceivable picture of himself He may not he outright, 
but on borderline responses he selects unfavorable alternatives This may be 
a calculated strategy to get the clinic to take his problems seriously and make 
therapy available to lum or a sign of high awaieness of symptoms 

Just the opposite effect is often noted when the client is discharged after 
treatment Now the self-descnption glows with the psychological counter¬ 
part of “Thanks, Doc I feel fine.” This may involve self-deception, to prove 
that the sacrifice of tune, money, and privacy was not foolish One important 
motivation of faking good, Hathaway suggests, is the client’s desire to repay 
the therapist by letting him see how much help he has given It would be un¬ 
grateful indeed for the client to dwell on the symptoms die therapy had left 
untouched. On his exit questionnaire the client may be disposed to give him- 
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self (and his therapist) the benefit of all the borderline decisions. The num> 
s her of symptoms is thus below the number reported at intake The therapy 
may have produced genuine improvement, but true impi ovement is hard to 
distinguish fiom change in test-taking attitude 

Investigations of faking compaie scores made under instructions to de¬ 
scribe oneself honestly, with scores made when directed to tiy foi a good 
score or a bad score All these studies demonstiate that faking is possible, we 
need cite only two representative findings, Longstaff gave the Strong and 
Kuder tests to students with the usual instructions, and then asked them to 
try simultaneously to fake a paiticular pattern high on ceitain keys and low 
on others The lesults, some of which are given in Table 59, indicate that 

TABLE 59. Percentage of Male Students Able to Fake Strong and Kuder 
Scores Successfully 


Scores 

"Faked Upward" 
Percentage reaching 
A on Strong keys 


Carpenter 9 Chemist 91 Artist 


Percentage reaching 
75th percentile on 

Kuder keys Mechanical 32 Scientific 5 Artistic 

Difference -23 86 


86 Author 83 

83 Literary 51 

3 32 


Scores 

"Faked Downward" 
Percentage reaching 


C on Strong keys 

Accountant 26 

Life insur- 20 

Personnel 

37 

Office man 54 



once 

manager 





sales 




Percentage reaching 






25th percentile on 






Kuder keys 

Compute- 30 

Persuasive 70 

Social service 

70 

Clerical 41 

tional 





Difference 

-4 

-50 

- 

-33 

13 


Source Longstaff, 1948, 

both tests are fakable On the whole, it is easier to fake high interests on the 
Stiong and easiei to fake aveision on the Kuder. The several keys are not 
equally fakable Wesman (1952) gave the Beimeuter Personality Inventory 
with the following mstiuctions “I want you to pretend that you are applying 
for the position of salesman in a large mdustnal oigamzation You have been 
unemployed foi some time, have a family to support, and want very much to 
land this position You are being given this test by the employment manager, 
Please mark the answers you would give ” The next week, the same inven¬ 
tory was filled out “as if you were applying for the position of librarian in a 
small town ” The scores on the two occasions differed spectacularly, as Fig¬ 
ure 77 shows Studies such as these piove beyond dispute that personality 
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tests can be falsified, no matter how constructed Probably most applicants 
give more honest answers than did the students in these expeiiments, but 
the fact lemams that the dishonest applicant can probably beat the test 




FIG 77 "Self confidence" scores of the same students when playing the role of applicant for 
sales and library positions (Wesman, 1952) 

3 . Some Strong items are obviously related to certain occupations, and these items 
are usually given higher weights than the subtler items which have a less direct 
relationship Garry (1953) finds that when subjects attempt to fake interest in a 
particular field they answer the "obvious" items correctly. Most of them are un¬ 
able to fake successfully on the items whose correlation with the criterion is 
lower and less obvious. Would it be a good idea to base Strong scores entirely 
on the latter items in order to thwart fakers? 

OVERCOMING DISTORTION IN SELF-REPORT 

Establishment of a Cooperative Relationship 

In any inteiview or personality test, the psychologist appeals for coopera¬ 
tion and employs his skill as best he can to produce rapport But rapport is a 
complex interpersonal relationship, depending on many factors other than 
the testei’s technique. Never may the tester safely assume that he has estab- j 
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lished the ideal relationship which will cause the subject to want to tell "the 
whole truth.” The tester’s choice of questions and his subtle modification of 
the testing situation may cause the subject to shift back and forth fiom con¬ 
cealing to confessing, but there is little chance that he will come to rest at 
“objectivity.” Moreover, as psychoanalytic interviews make clear, the sub¬ 
ject’s memory of his own life is so distorted by his emotional conflicts that 
long therapeutic sessions aie required befoie he can bring to consciousness 
some of the important facts about himself 

Constraint upon Responses 

Variation m lesponse style is leduced 01 eliminated by foicmg all persons 
to lespond to the same issue The most common technique is the “forced- 
choice” item seen in the Kuder mventoiy Instead of using the somewhat am¬ 
biguous categones L, I, and D, Kudei asks which of thiee activities the sub¬ 
ject hkes best Now the subject cannot say he likes eveiythmg, 01 withhold 
information by checking “Indiffei ent ” The foiced choice demands infoima- 
tion legaidmg specific attitudes, tnuts, and inteiests. 

Forced choice is especially useful as a means of 1 educing fapade effects 
The populanty (“social desirability”) of each statement can be determined 
in a prehnnnaiy study The test constiuctoi then foims sets of statements 
having equal desnability This increases the amount of mfoimation obtained, 
as can be seen fiom the following example. Three inteicst items might be: 

Peicent of Subjects Saying “Like” 


Watching Western movies 

90 

Dnvmg in the country 

90 

Bud-watching 

10 


Administeied separately, these would be inefficient foi detecting individual 
difEeiences, since 90 percent of the subjects give the same answei. If movies 
were paired with bud-watching in a pietcience item, lesults would be little 
better, at least 80 peicent of the subjects would pieiei movies If the movie 
item is pan eel with the equally popuhu driving item, the forced choice will 
divide the subjects into neaily equal gioups, thus obtaining a maximum 
amount of information about differences in inteiest 
The subject who wants to “fake good” is outwitted by the forced choice 
between equally desirable tiaits He is required to describe which good 
statements aie most characteristic of him, and which faults he suffeis to the 
greatest degree Navian and Stauffacher (1954) asked student nuises to 
rank fifteen needs (eg, deference) in ordei of their social desn ability. Each 
nurse also ranked the needs fiom most to least characteristic of herself. The 
correlation between these two rankings for any nurse would indicate her 
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tendency to give a favorable picture of her own needs, tins conelation was 
higher than 44 for the majority of nuises A quite different lesult was found 
on the Edwards forced-choice inventory In not one case did tire rank order 
of Edwards scores correlate as high as 44 with the desirability ranking An¬ 
other study employed a direct measure of fagade or self-favorableness This 
score correlates 50 to 90 with Tayloi’s anxiety, Guilford’s cooperation, and 
similai scales which score Yes-No answers The highest correlation among 
the fifteen scales of the Edwards forced-choice mstiument is 32 for the En¬ 
durance score (Edwards, 1957, R E Silverman, 1957) Although the Ed¬ 
wards scale forces the subject to give a profile with some low points, it is by no 
means pioof against faking A subject who can guess that eeitam good quali¬ 
ties aie more important than others in decisions the testei will make can dis¬ 
tort Ins responses to earn high scoies in those qualities and low scoies in qual¬ 
ities which seem unimpoitant on this occasion 

The forced-choice method has its defects It lequues moie time to obtain 
an equal number of responses It is sometimes resisted by subjects who ob¬ 
ject to its “Have you stopped beating youi wile?” chaiactei And it may re¬ 
duce the validity with which the test predicts external critena, foi reasons 
discussed below 

4. What sort of personality would you try to show if "faking good" 

a. in a test to select Boy Scout leaders'? 

b. in a test to select scientists for advanced training? 

c. in a test for psychiatric ward attendants'? 

Empirical Validity of Forced-Choice Instruments In some mventoiies, elimi¬ 
nating response sets eliminates the significant, critenon-related infoimation 
from the scores As mentioned earhei, the F scale measuring fascist tenden¬ 
cies seems to be largely influenced by acquiescence This seemingly ir- 
lelevant influence may be desirable Tendency to accept extreme statements 
may m itself be a symptom of an authontanan outlook, if so, a forced-choice 
scale which luled out diffeiences m acquiescence would eliminate important 
evidence (Christie et al , 1958; Gage et al , 1957) 

Our piesent knowledge about acquiescence, fagade, and other response 
styles may be summarized as follows (Cionbach, 1950) 

® General response styles obscure descnptive information A person who 
says that he likes nearly everything tells us little about his particular interest 
patterns 

• Response sets can be modified by changing the directions. They aie 
therefore to some extent transient, and irrelevant to the intended measure¬ 
ment. 

• Response styles often correlate with practical critena, but the correla 
tions are not high 



12 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


] • The response style involves three types of variation, transient and un¬ 
reliable attitudes, stable patterns which reflect ciitenon-relevant aspects of 
the personality, and stable but psychologically unimpoitant verbal habits. 
The forced-choice technique eliminates all three types of variation so that 
scores depend only on reactions to the content of items. 

o Reliability is decreased by shifting to the foiced-choice form, because 
choice is made difficult, The foiced-choice instrument is oidinanly a purer 
measure of the criterion-ielevant qualities in the test, because irrelevant ver¬ 
bal response habits are eliminated. Changing to the forced-choice form may 
or may not laise piedictive validity because the loss in reliability may offset 
the gain in lelevance 

The foregoing statements seem to vacillate between regarding response 
styles as beneficial and legal ding them as an interfeience The statements 
are reconciled when we lecogmze that the effect of response styles depends 
on test length Accoiding to the argument developed on page 130, the effect 
of length depends on the puiity of a test A shoit impuie test capitalizes 
upon the limited predictive validity of lesponse styles and oidinarily gives 
a highei conelation with a cuteiion than does a shoit forced-choice test 
When the number of items is veiy laige, the purer forced-choice test is more 
valid (Osburn, Lubin, Loeffier, and Tye, 1954) 

When the subject is motivated to give a favoiable leport on himself, even 
a shoit forced-choice questionname is likely to be advantageous. On an 
opinion oi mteiest questionnaire, faking is not geneially a serious pioblem. 
Since “eveiy man has a light to his own opinion,” no one set of answers is 


usually consideied especially desirable Unless the subject has a special 
reason for concealing his views, he will answei fiankly “The respondent’s 
understanding of the purpose of the test and the psychologist’s understanding 
are in agreement Weie the lespondent to lead the psychologist’s report of 
the testiesults, none of the topics would surpuse him” (Campbell, 1957) 

Concealing the Purpose of the Test 


Some tests of personality openly refer to themselves as measures of adjust¬ 
ment More commonly, the title is less informative foi example, “The Cali¬ 
fornia Personality Inventory.” The subject does not know what scores will 
be recoided and what interpretations will be made Pie may guess 
something from the content of the items, but he is unlikely to suspect that m- 
teipretations will be made about his tendency to delinquency, among other 
things It is liaider for the subject to fake when he does not know what tire 
testei is looking for, though in that situation he may become even more sus¬ 
picious and defensive m his responses. 

An effective method of concealment is to state a plausible purpose which 
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is not the tester’s real center of interest m giving the test The F scale, for in- ' 
stance, is on the surface an inventory of opinions, but it is used to draw con¬ 
clusions about the underlying personality. If the content of a test is such that 
subjects regard it as a measure of some ability, faking can be reduced or even j 
elimmatedAjQampbell (1957) discusses use of measuies of knowledge or lea- , 
sonmg ability as disguised measures of attitude Another type of disguise uses 
questions having one ostensible content but employs a scoung method which 
has little 01 nothing to do with that content,! One mvestigatoi asked boys to 
check which books they had read, seemingly to measuie leading inteiests 
Actually, he had inseited fictitious titles m the list, and the number of such 
titles checked was taken as one indicator of deceit or boasting 
While disguising one’s puipose may be effective, if sknts the edge of un¬ 
ethical piactice And, as one wnter has commented, to tiy to pi event decep¬ 
tive subject behavioi by becoming deceptive oneself meiely encouiages the 
view that psychologists are tucky, and m the long run may dnve subjects to 
even gieatei degrees of evasiveness. 


Verification and Correction Keys 

The Kuder interest mventoiy has a special verification scoie, obtained by 
counting the subject’s responses to certain items which aie laiely chosen, A 
subject who made a laige number of these raie lesponses piobably answered 
the items without propel concentiation This by no means detects all types 
of distortion, but it is of value m gioup testing Some subjects aie too little 
motivated to make the many piefeience judgments seriously, and tlieie are 
even some who lose inteiest and simply mark at random from that point to 
the end of the test 

The Edwaids inventory uses 210 forced-choice pairs of statements. Fif¬ 
teen pairs aie piesented a second time at random mteivals within the test A 
verification score indicates whether the student gave the same answei on the 
two occasions when he made the same choice Some inconsistencies are to 
be expected, but numerous reversals suggest eithei caieless i espouse, resist¬ 
ance, or a seriously confused self-picture Othei inventoiies use a vanety of 
check scoies, including fapade oi social desuabihty keys and keys to detect 
response styles (6 g, a count of evasive "Cannot say” responses m the Min¬ 
nesota Multiphasic). 

The check scoie may be used most simply to eliminate suspect records It 
is also possible to apply statistical coirections which estimate the score that 
would have been obtained with a normal response style. 

5. If a high-school Senior earns a suspiciously high verification score on the Kuder, 
what should the counselor do? 
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ALTERNATIVE INTERPRETATIONS OF RESPONSE CONTENT 

No matter what special procedures are used to reduce distortion, the test re¬ 
sponses depend upon how much of the truth the subject is willing and able 
to report, Interpretation must take this fact into account 

Interpretation as True Self-Description 

Simplest but most hazardous is to interpret the responses as a frank report 
of the subjects typical behavior If the relationship between testei and sub¬ 
ject is such til at this is a reasonable expectation, then no subtleties of test de¬ 
sign are requned 

Complete frankness cannot be anticipated m any situation wheie the sub¬ 
ject will be rewarded or punished foi his lesponse Some degree of reward 
and punishment is implicit m any institutional use of tests, such as clinical di¬ 
agnosis or employee selection Honest self-examination can be hoped foi only 
when the testei is helping the subject to solve his own pioblems, and even 
then the subject may have a goal foi which he wishes the suppoit of the 
counselor’s authority, which biases his lesponse 

Interpretation as “Published” Self-Concept 

It is more leasonable to interpret the repoit as a statement of the subject’s 
public self-concept than as a statement of his typical behavioi or of his pn-' 
vate self-concept. To be sure, lus public self-concept should conespond m 
some measure to his behavior, but the ambiguity of test items and the inevi¬ 
table distortion m self-obseivation reduce this correspondence 

A historian, examining a diary wntten by a long-dead statesman, refuses 
to assume that the statements made therein aie true reports of the man’s be¬ 
liefs and feelings Unless theie is considerable evidence that the document 
was a private one never intended for the light of day, the safest assumption 
is that the statements repiesent the image the man wished to leave in his- 
toiy The psychologist likewise can regard the responses of his subject as a 
“published” self-concept, a statement of the reputation the subject would 
like to have 

Sometimes this inhumation may be of considerable value The fact that 
an individual is unable to admit certain kinds of tabooed impulses may be 
highly diagnostic A peison who presents too perfect a pictuie of himself 
may be expiessing his fear that others are critical and punitive, and that he 
can maintai n their respect only by keeping his halo bright Unless Ihere is 
some obvious motive for deceitful response, the psychologist should suspect 
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that the person who presents so perfect a facade on the test maintains a simi¬ 
lar fagade nr-dHns-sucia‘l"relations. The fagade~oT perfect control and fiee- 
dom frSnflrhpulse is' a brittle''one and can be maintained only at-consider¬ 
able emotional cost, Hencejhe fagade itself h as di agnostic and prognostic 

significance. 

The person who admits to certain e motio nal problems may also be build¬ 
ing up a public image These may not be the most important pioblems of 
which he is conscious It is commonly observed in psychotherapy that people 
do not brmg o ut their mam problems until several interviews have passed 
Wh^q a person admits to problems which call for counseli ng, his report is an 
invitation to open co unseling with an examination of the aiea mentioned 
He is saying, fiist of all, that he is willing to be counseled', sec ond, that this 
area is one which conceins him but is not too sensitive to be discussed His 
most serious conflicts may be completely concealed by his questionnaue re¬ 
sponses, but it he is unwilling to admit these conflicts he is piobably also un¬ 
willing to deal with them immediately in psychotherapy 

6. A questionnaire is filled out by all parents belonging to a study group, as a 
means of identifying problems to be taken up in group discussion. Mr. Smith 
checks many problems having to do with developing the child’s honesty, respect 
for the property of others, and care for his own property The school counselor 
knows, however, that his son has been in difficulty several times because of 
aggressive fighting on the playground, window breaking, and other aggressive 
offenses which have been called to Mr. Smith’s attention. Can the counselor 
draw any useful conclusion from Mr. Smith’s self-report’ 

7. An attitude test for foremen presents hypothetical problems that might arise on 
the |ob and asks the sub|ect to indicate what action he would take if he were 
foreman Scores are based on response patterns (e g., "takes quick action," 
“seeks facts," "emphasizes morale,” “emphasizes cost-cutting"). What use can 
be made of the responses, in view of the obvious temptation to give a desirable 
picture’ 

Dynamic Interpretation 

The clinical psychologist is unwilling to reduce personality to a statistical 
report of oveit behavior The clinician is concerned not with the number of 
times a person becomes emotionally upset but with the conditions undei 
which this happens and the forces, internal and external, that lead to it An 
individual who now becomes upset once per month might become chroni¬ 
cally disturbed if conditions changed in a certain mannei In this event, a 
statistical average of his past behavior would have almost no predictive 
value A “dynamic” picture of an individual is a picture of the forces chang¬ 
ing his response as situations change Important in such a picture are his per¬ 
ceptions of the people he deals with, his feelings about himself, and the 
needs which he is trying to satisfy If the clinician has insight into these hid- 
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den characteristics, he then has some hope of piedictmg reaction to particu¬ 
lar opportunities or stiesses 

Diawmg conclusions about peisonality dynamics even fiom an extended 
series of interviews is difficult, and a bnef test or series of tests can only of¬ 
fer hypotheses of questionable validity. Tests aie commonly used as a basis 
for dynamic inteipi etation, liOvvevei, and highly experienced and insightful 
interpreters can use them profitably. The basic assumption of any dynamie 
interpretation is that eveiy act of behavioi is meaningful, even when it is in¬ 
consistent with other obseivations. The task of the mteipietei is to seek some 
underlying unity which lesolves the contradiction 

Dynamic interpretation lequnes extensive data about the individual’s en¬ 
vironment and difficulties as well as his test score, and consideiable knowl¬ 
edge of pei sonahty theory It can be demonsti ated in a brief example, using 
an analysis by a University of California counselor Barbaia Kiilc (1952) de¬ 
scribes the pattern often found among academic failuies 01 near-failuies who 
do well on aptitude tests given at the Counseling Centei (Our summary is 
drastically condensed ) 

The explanation and the excuses for the academic deficiency aie uniealishc, 
superficial, and laigely implausible The counselee demonstrates no real recogni¬ 
tion 01 admission of the i casons foi this deficiency, but, on the othei hand, lie 
evidences no suipnse at the results of the tests lie may be surpnsecl that he was 
not tense or botlieied on tests adnumsteied to him dining counseling because lie 
frequently has been tense oi botheied dui mg academic examinations The lmpies- 
sions rcgaiding the Minnesota Multiphasic Peisonality Inventoiy lecoids m these 
cases are 

Most fiequent is “psvchoneuiosis with compulsive and depiessive featuies ” 
Such [peisons] tend to be peivasively resistant on an unconscious level to any 
externally imposed task Since childhood, howevci, they have concealed such 
resistances fiom themselves and otlieis by a fagacle of hard-woikmgness, meticu¬ 
lousness, and earnest dutifulness In the unstiuctuied environment of a umveisity, 
the loss of the continued external pushing of teachers and paients peimits the over¬ 
throw of the piocess of grudging achievement, and the lesislances then manifest 
themselves mnonperfoimance 

The academic failuie piobablv has meaning in terms of unconscious satisfaction 
of the hostility usually dnected towaids some membei of the family who demands 
success, while the excellent scoies on tests taken in a counseling situation may be 
intei preted also as hostile gestuics. Because no impoitance is attached to these 
tests, the counselee is fiee to do with them as he wishes It is a declaration, per¬ 
haps, of the lack of significance of his academic failuie 

It can be seen that such intei pi etation s involve considerable speculation, 
but they aleit the counseloi to conflicts that may emerge duiing counseling. 

INTERPRETATION OF RESPONSES AS DIAGNOSTIC SIGNS 

The risky assumption that the subject is telling the truth can be avoided if 
we mterpiet his response, not as self-description, but as an act of verbal be- 
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havior that is coirelated with his inner nature These two appioaches have 
been cliaractenzed by Floience Goodenough as the "sample” and “sign” ap¬ 
proaches In legal ding responses as samples of behavior, we use transparent 
items and pay primary attention to the content of the responses We are at 
the meicy of the subject who wants to mislead us If we plan to regard re¬ 
sponses as signs, we can use items whose surface content is melevant to what 
we wish to measuie, and even distoited lesponses may have diagnostic value 

The Strong blank, as originally employed, is based on the “sign” principle 
Strong did not include activities m his Engineer key because they are part of 
the engmeehs work, he asked only whether the lesponse was characteristic 
of engineers By counting dozens of such “signs,” he distinguishes men who 
resemble the typical engmeei fiom those who have little m common with en¬ 
gineers The interpretation that a peison belongs hi a category is made on a 
strictly actuarial basis Strong can say, "Persons with this combination of re¬ 
sponses tend to become engineers” m the same way that an msuiance exam¬ 
iner might say, “People with this combination of weight, blood pressure, and 
heait condition larely live beyond 70,” In strict actuanal interpretation, the 
tester makes no pretense of a rational connection between a particulai re¬ 
sponse and the enteuon Engmeei s have gi eater than average liking tor The 
National Geogiaphic Magazine, prediction can be based on this fact whether 
or not any psychological significance can be attached to it 

The actuanal approach eliminates the assumption of honest self-ieport 
The question "Is youi health bettei or poorei than aveiage foi yom age?” 
does not obtain valid facts about health One peison oveirates his health m 
repoitmg, anothei who has only minoi ills exaggerates them If clinically di¬ 
agnosed neuiotics leply "poorer” moie often than do normals, this answer 
may be diagnostic even when it is “untrue”—in fact, it may be diagnostic 
just because it is untrue Empmcal scales take the “attitude that the verbal 
type of peisonahty mventoiy is not most fruitfully seen as a ‘self-rating’ 01 
self description whose value requires the assumption of accuiacy on the part 
of the testee m Ins observations of self Rathei is the lesponse to a test item 
taken as an mtnnsically mleiestmg segment of verbal behavioi, knowledge 
regarding which may be of more value than any knowledge of the ‘factual’ 
matenal about which the item supeificially purports to mquiie Thus if a hy¬ 
pochondriac says that he has ‘many headaches’ the fact of interest is that he 
says tins” (Meehl, 1945, p, 9) 

The empirically scoied test can be used for purposes the subject never sus¬ 
pects The Strong ostensibly assesses vocational inteiests, but one scoring key 
combines those items which men answer diffeiently from women into a “mas¬ 
culinity-femininity” score It is presumably possible to distinguish commu¬ 
nists from noncommunists, or gnls who are likely to marry and stop working 
from those who are likely to remain in an occupation The inventory could 
likewise be keyed to distinguish juvenile delinquents from nondelinquents, 
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or potential suicides from nonsuicides The basic principle is that any group 
which differs on one psychological quality differs on other qualities Reports 
on some of these qualities are likely to be falsified to gam social approval 
The remaining qualities, which carry no connotation of approval or disap¬ 
proval, permit valid mdiiect measuiement Actual lal scoring is by no means 
certain to eliminate distoitions, as the faking studies on the Strong blank 
show Basing the scoring weights on empirical connections makes it more 
difficult for the subject to guess what significance will be attached to his 
statement that, for example, he likes to read the Geogiaphic 

Within an empirically scored test, one can distinguish between items that 
are “obviously” lelated to a key and those whose connection is indirect For 
example, in taking the Strong blank a boy trying to fake a high score as Engi- 
neei could be expected to indicate a maiked liking for mathematics and 
technical subjects, these aie “obvious items ” He would be less likely to real¬ 
ize that mterest m The National Geogiaphic is characteristic of engineers, 
tins item may be classified as “subtle ” If we were to make up two scoring 
keys, one for obvious and one for subtle items, we would peihaps be able to 
make much more valid distinctions A peison who has a high scoic on both 
keys is moie suiely like an engineer than one who is high only on the obvious 
ones 

This suggestion was introduced in connection with the MMPI by D N. 
Wienei (1948, see also Seeman, 1952) and legiettably has leceived little at¬ 
tention, Wienei developed sepaiate keys for five of the MMPI clinical scales 
No direct validity studies on these keys were earned out There is evidence, 
however, that the subtle and obvious keys differ m their susceptibility both 
to facade effects and to lesponse sets (Table 60) 

TABLE 60. Correlations of MMPI Scores with Meas¬ 
ures of Facade Effect 


Correlations with Facade 
Score for 

Obvious Subtle 

Scale Items Items 


Depression 

- 78 

33 

Psychopathic deviate 

- 85 

27 

Paranoia 

- 72 

06 

Manic 

- 53 

40 

Hysteria 

- 71 

54 


Source Fordyce nnd Bozynko, see Edwards, 1957, p 47, 
see also Fncke, 1957, Hanley, 1957 

The usefulness of empirical tests depends primarily on the adequacy of 
the validation experiments Sometimes absurd weights are assigned to items, 
suggesting that the onginal validation was based on inadequate sample. All- 
poit (1937, p 329) piotests against a scale m which the word association 
“green” to the stimulus “grass” is scored +6 as a sign of “loyalty to the gang.” 
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Loyal boys might have given this response more often in the sample tested 
than disloyal boys, but it is implausible that the same lesult would be found 
m furthei studies Laige and lepresentative samples are crucial foi establish¬ 
ing empirical keys At best the validity of empirical keys is only moderate, 
since they rely on indirect information Because they use subtle items they 
aie less readily explained to clients than aie tests using content interpreta¬ 
tion. 

8. Distinguish in each of these cases whether the investigator is assuming that self- 
reports are truthful 

a. The clinical symptoms of condition x are determined by observation A list 
of symptoms (swollen feet, rash, etc ) is prepared This list is used to de¬ 
termine how frequent condition x is in several localities Each sub|ect is 
asked to check whatever symptoms he has 

b. There are three general stages in social development in which a child names 
as favorites (1) other children without regard to sex, (2) persons of his own 
sex exclusively, (3) persons of the opposite sex The investigators ask a child 
to name his favorite playmates as a means of determining his level of de¬ 
velopment 

c A psychologist administers to a group of applicants a checklist in which each 
marks the adjectives that describe him. The success of these men is observed, 
and a record is made of the characteristics checked by the successful ap¬ 
plicants but not by the others This checklist is then given to further applicants, 
and those who check the same characteristics as the previously successful 
men are hired 

9 What use could be made of a scale predicting what girls are likely to marry? 

ETHICAL ISSUES IN PERSONALITY TESTING 

Personality testing has flourished in two contexts, one institutional, the other 
individual Valid information about peisonality would presumably be of 
great value to employers, college admissions officers, and others who make 
decisions to cairy out institutional policies In fact, personality tests were fiist 
applied to scieen potentially neurotic soldiers. Such institutional testing tries 
to deteimine the tiuth about the individual, whether he wants that truth 
known or not In nomnstitutional testing, tests aie applied foi the benefit of 
the person tested Here also the testei believes that learning the truth will be 
valuable but does not feel fiee to violate the person’s wishes The client who 
comes with an emotional difficulty wants the psychologist’s assistance, but he 
may be quite unpiepared to pay the price of unveiling his soul, 

Any test is an invasion of pnvacy for the subject who does not wish to re¬ 
veal himself to the psychologist While this problem may be encountered m 
testing knowledge and intelligence of persons who have left school, the per¬ 
sonality test is much more often legarded as a violation of the subject’s rights 
Every man has two personalities the role he plays m his social interactions 
and his “tiue self ” In a culture where open expiession of emotion is discour- 
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aged and a taboo is placed on aggressive feelings, foi example, there is cer- 
tani to be some discrepancy between these two personalities. The personality 
test obtains its most significant mformation by piobmg deeply into feelings 
and attitudes which the individual noimally conceals One test puiports to 
assess whether an adolescent boy lesents authonty Another tues to deter¬ 
mine whether a mother really loves her child. A thud has a score indicating 
the strength of sexual needs These, and vntually all measuies of peisonality, 
seek mformation on areas which the subject has evciy reason to regard as 
private, m noimal social intercourse He is willing to admit the psychologist 
into these pnvate aieas only if he sees the relevance of the questions to the 
attainment of his goals in working with the psychologist The psychologist is 
not “invading privacy” wheie he is fieely admitted and wheic he has a gen¬ 
uine need for the information obtained 

Some testeis are regarded as "espionage agents” in mdustiy (Ohs, 1957) 
The newspapeis have leported one case of a psychologist who developed for 
an mdustnal client an inventory intended to detect applicants with stiong 
proumon attitudes, so that the client, by 1 ejecting such men, could keep the 
union weak m his plant As the tester finds increasingly valid ways of de¬ 
tecting what men feel and think, and as tests are mcieasingly imposed by 
schools, employeis, and military services, theie will be senous dangei of con¬ 
flict between the demands of the psychologist’s employers and the lights of 
the pel son tested 

Responses have to be evaluated in terms of conformity to some ideal The 
employei who used tests to detect union supporteis dictated the attitude 
he wanted employees to have If it is repugnant to find a powerful figure 
dictating what a citizen may say, it is unthinkable that he should have the 
power to punish unutteied thoughts Yet that is what a subtle measure of 
attitudes thieatens when used for institutional purposes Defining certain 
score patterns as good necessarily makes the test a foice toward conformity 
and standardization. 

The use of personality' tests for selection mouses resistance, as the pieva- 
lence of faking indicates Calls foi open rebellion flaie up fiom time to time 
in the public press, a notable example being The Organization Man, the 
challenging book of essays by William H. Whyte, Ji (1956), one of the edi¬ 
tors of Fortune He wains men seeking executive positions that they can 
count on favoiable recommendations fiom the psychologist who examines 
them only if they display a paiticular pattern' extiovert, uninterested m the 
arts, and acceptant of the status quo He advises them to fake “normality”' 

. Give the most conventional, run-of-the-mill, pedestnan answer 
possible When m doubt about the most beneficial answer to any ques¬ 
tion, repeat to yourself 
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I loved my father and my mother, but my father a little bit more 
I like things pretty much the way they are 
I never worry much about anything 
I don’t care for books or music much 
I love my wife and childien. 

I don’t let them get m the way of company work 

Whyte is certainly incorrect m describing this one inteipretation as repre¬ 
senting the piactice of all industrial psychologists concerned with executive 
selection, but legaidless of what the psychologist concludes, the firm to which 
he repoits is likely to piefer the man who has “safe” attitudes 

The Standards of Ethical Behavtoi for Psychologists (1958) include the 
following principles 

e The psychologist in mdustiy, education, and other situations m 
which conflicts of inteiest may anse among vaned parties as between 
management and labor, defines foi himself the natuie and duection of 
his loyalties and responsibilities and keeps these paities mfoimed of 
these commitments 

• [When seivmg the individual] the psychologist infoiins his pros¬ 
pective client of the impoitant aspects of the potential lelationship that 
might affect the client’s decision to entei the relationship 

® The psychologist who asks that an individual leveal personal in¬ 
humation in the course of interviewing, testing, or evaluation, or who 
allows such mfoimation to be divulged to him, does so only after mak¬ 
ing certain that the peison is awaie of the purpose of the interview, 
testing, or evaluation and of the ways in which the mfoimation may be 
used 

No ethical objection can be raised to the use of subtle techniques and even 
of misleading instructions when the mfoimation so obtained will be used en¬ 
tirely foi reseaich puiposes, the subject’s identity being concealed m any le- 
port Even when the tests aie intended solely foi leseaich, the tester should 
not be a peison who has otliei lesponsibilities toward the subject (e,g,, his 
teachei 01 therapist) except undei the conditions descubed below 

Whethei seivmg an institution or serving an individual client, the tester 
should not use mdnect and misleading techniques unless the subject clearly 
understands that “anything he says may be used against him ” To be suie, an 
employei may legald Ins lefusal to submit to tests as giounds for denying 
him employment, but this is ethically pieferable to obtaining deceitfully in¬ 
formation he does not wish to give 

In a clinical setting, the psychologist can likewise offer a choice, with an 
introduction of approximately this chaiacter (see also pp 293-296). “It 
might help to solve your pioblem more rapidly if we collect as much mfor- 
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mation as we can. Some of oui tests use straightfoiward questions whose 
purpose you will readily understand Some of our other tests dig more 
deeply into the peisonahty. Sometimes they bring to light emotional con¬ 
flicts that the person is not even conscious of Few of us admit, even to our¬ 
selves, the whole tiuth about oui feelings and ideas I think I can help you 
better with the aid of these tests ” 

The client may refuse to take disguised tests if he is not ready to tiust the 
psychologist with full knowledge of his personality. If this is the case, the in¬ 
humation probably could not be used constructively m counseling him. In 
counseling it is both advantage and disadvantage that dnect, unsubtle tests 
are no moie than tabulations of statements made by the peison about him¬ 
self While they uncover no seciets, they frequently acceleiate the counsel¬ 
ing piocess because they represent things he is ready to discuss with the 
counselor. 

Theie remains the question of using peisonahty tests when the tester has 
authority ovei the person tested The psychologist diagnosing mental pa¬ 
tients, the military psychologist, 01 the schoolteacher can enfoice tests on his 
cliaiges The standards with regaid to such piactice piobably should vaiy 
from institution to institution, In geneial, it seems that subtle tests may 
properly be used if they aie valid and relevant in making decisions which 
would otheiwise lest on less valid information The tester should avoid 
misrepresentation in giving the tests Foi example, it is quite impropei to 
study an individual’s beliefs undei the guise of an opinion poll Test recoids 
made for employee counseling should never be made available to the em¬ 
ployee’s superiors 

10 Would it be proper for a psychologist working in a government intelligence 
agency to develop a key for scoring the Strong or MMPI so as to detect com¬ 
munists among college students’ 

11. Should a test used in premarital counseling, given separately to both engaged 
persons to determine their suitability and probable success as marriage part¬ 
ners, use direct or subtle questions’ 

12 Is it ever “an invasion of privacy" to administer an ability test to a prospective 
employee? 

13. The Minnesota Teacher Attitude Inventory attempts to identify teachers who 
have the attitudes that lead to high ratings from principals Which would be 
best in a school staff uniform attitudes or variety? 


Suggested Readings 

Krugman, Morns Changing methods of appraismg peisonahty Proceedings, 1956 
Invitational Testing Conference Puneeton Educational Testing Seivice, 1957 
Pp 48-57 

In an evaluation which stresses the inadequacy of questionnaires and projec- 
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tive techniques, Krugmans composite of extiacts fiom test reviews and his 
statement of emerging tiends aie of particular interest 
Longstaff, H P, & Juigensen, C E Fakability of the Juigensen Classification 
Inventory 1 appl Psychol, 1953,37,86-89, 

This descubes an illustrative expenment on faking of a foiced-choice mven- 
toiyundei different sets of dnechons. 

McClelland, David C Roles and lole models Personality New Yoik Diyden, 
1957 Pp 289-332 

Describing personality m terms of typical behavioi is not completely satisfac- 
toiy because behavior vanes with the situation McClelland illustrates and 
accounts for such inconsistency in teims of changing social ioles 
Meehl, P E The dynamics of “structuied” peisonality tests ] dm Psychol, 1945, 
1,296-303 (Repunted m G S Welsh & W G Dahlstrom (eds ),Bm read¬ 
ings on the MUPl in psychology and medicine Minneapolis Umveisity of 
Minnesota Press, 1956 Pp 5-111) 

Meehl argues that self-repoit is undependable and uninteipretable if taken 
at face value, and defends actuarial keying as the only suitable method of 
obtaining useful insight fiom questionnaires 
Whyte, William II, Jr The tests of conformity The organization man New Yoik, 
Doubleday Anchoi, 1956 Pp 201-222 

This is a scathing cntique of personality tests as used in executive selection 
Examine also the Appendix, How to cheat on personality tests, 
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HISTORY OF PERSONALITY INVENTORIES 

IN INTRODUCING peisonality measuiement we have spoken of it as an 
attempt to assess "typical behavior ” This phi as e, which has served our pur¬ 
poses to this point, echoes the viewpoint of behavionstic psychology, which 
is concerned piimanly with oveit, obseivable lesponses The behavioristic 
outlook is somewhat limiting, howevei, and we can undeistand personality 
assessment bettei if we lecognize that its development has been strongly 
influenced by the attitudes of 'phenomenological psychology Phenomenolog¬ 
ical psychology is concerned with the way the woild appeals to the individ¬ 
ual, with his so-called pnvate woild Such expressions as self-concept, feel¬ 
ings of hostility, and attitude toward authority leier to peiceptions and 
leactions occumng within the individual Many important psychological 
events such as hallucinations and diearns exist only m the peison’s conscious¬ 
ness. It can be aigued that almost all crises of adjustment are shaped moie 
by the individual’s perception of events than by the events themselves As a 
consequence, many psychologists aie moie concerned with the subjective 
reactions of the peison than with his outwaid responses 

The first peisonality questionnaires weie developed in an attempt to study 
the inner woild of peiception and feeling Sn Fiancis Galton m the 1880’s 
devised the technique when he needed a standaid pioceduie which could 
be applied to numeious subjects for his studies of mental imagery Use of 
questionnaires, again for leseaich purposes, was extended later in the nine¬ 
teenth centiuy by G Stanley Hall in his vast studies of adoles_cent develop¬ 
ment He used mfoimation given by laige samples of adults to delineate 
noimal tiends m development, being little concerned with single individ¬ 
uals 

The questionnaue served lather diffeient functions for the two men In 

464 
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Galton’s work, self-report was used as the only possible way to obtain infor¬ 
mation on events within the respondents head Hall’s self-report was used to 
avoid the labor and delay involved m direct obseivation of behavior. 

t. Which of these problems or topics of research falls within behavioristic psy¬ 
chology, and which within phenomenal psychology’ 

a How frequently does mood change in the typical woman’ 

b. How much does speed of reading decline in the presence of loud, con¬ 
tinuous noise? 

c. Do managers and workers describe the company policy on selecting workers 
for promotion in the same way? 

d. What is this child afraid of? 

Adjustment Inventories 

The fust inventory primarily concerned with assessing the individual was 
the Woodworth Personal Data Sheet The U S Army, at the beginning of 
World Wai I, wanted to detect soldicrs likely to bieak down m combat, but 
individual psychiatnc mteiviews weie not piaclicable when lcciuits weie 
processed by the thousand Woodwoilh made a list of symptoms such as 
psychiatrists would touch upon in a scieenmg inteiview and piesented the 
list as a questionnane. Tins pencil-and-paper version of the interview pie¬ 
sented questions such as a psychiatrist would ask “Do you daydieam fie- 
quently?” “Do you wet your bed?” etc It differed fiom the mteiview only 
in that the sensitivity of individual questioning was sacrificed foi speed Men 
who repoited numerous symptoms were singled out for furtliei examination 
The test was valued because it had appieciable power to detect maladjusted 
soldieis an a situation where individual mteiviewmg of every man was to¬ 
tally out of the question 

The Woodworth scale was a foieiunner of a number of “adjustment inven¬ 
tories,” which consist pumanly of lists of pioblems, symptoms, or grievances 
to be checked. These instiuments make little claim to subtle description of 
personality, often yielding only a single scoie repiesenting level of adjust¬ 
ment Sometimes only one type of symptom is emphasized, as in the Cornell 
Medical Index covering psychosomatic complaints Sometimes the items aie 
grouped by logical categories, as in the Bell Adjustment Inventory, which 
has scoies for home, health, social, and emotional adjustment based (re¬ 
spectively) on items such as 1 

1 Items quoted in this chapter come from various tests and are used by permission of 
the copyright holders Bell Adjustment Inventory, copyright 1934, 1938, 1959, Consulting 
Psychological Press, Minnesota Multipha " 1 T 1 ''orr^ apyright 

1943, University of Minnesota', published 1 I • i ' ( I lurstone 

Temperament Schedule, copyright 1949, Science Research Associates, Minnesota Person¬ 
ality Scale, copyright 1941, The Psychological Corporation, Minnesota Counseling In¬ 
ventory, copyright 1953, University of Minnesota, published by The Psychological Corpo- 
lation 
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Has eitliei of your parents frequently cnticized you unjustly? 

Are you subject to eye strain? 

Would you feel very self-conscious if you had to volunteei an idea to stait a 
discussion among a gioup of people? 

Do you get discouraged easily? 

An adjustment inventory consists of items that differentiate subjects known 
to be maladjusted fiom subjects judged normal 

One principal use of such mventoiies is to identify those who should be 
offered counseling While “pioblem cases” who cause trouble aie easily 
recognized, children and adults who are withdrawn and insecuie may not 
attract the attention of observeis An adjustment inventory bungs to light 
many of these cases Simple though the mventoiy may be, it can play a 
valuable role m large guidance piogiams Some indication of the demand foi 
such aids is the fact that one modest inventory reported, after ten years of 
distribution with no special advertismg, that half a million copies had been 
sold 

Adjustment inventones are best regarded as screening instruments which 
single out persons who freely check symptoms and self-cnticisms They are 
not definitive measures of any clearly defined tiait, such information as they 
piovide is supeificial at best 

Trait Descriptions 

During the period fiom 1920 to 1945, psychologists weie largely behavioi- 
lstic m outlook and unwilling to base conclusions on the individual’s mtio- 
spections The mventoiy was thought of as primanly a substitute for ob- 
seivation of beliavioi, and the questions placed more emphasis on what the 
individual did than upon how he felt oi what he thought The questionnaire 
was broadened to describe as many aspects of behavior as possible, and re¬ 
sponses were summarized by giving scoies on a number of “tiaits” or re¬ 
sponse patterns. Personality was conceived dunng this penod as a bundle of 
habits The individual was descnbed by the strength of such traits as fuend- 
lmess, confidence, persistence, etc. A “strong” trait was one desciibmg a 
response which he usually oi frequently made 

In early inventories this list of tiaits or behavior categories to be scoied 
was aibitrarily chosen Some traits such as sclf-confidence came from com¬ 
mon expenence, and some such as mtiovcision fiom personality theories 
Dozens of instruments were produced, each taking items from its predeces¬ 
sors, adding a few new ones, and scoimg them in new combinations. The 
best-known mstiument of this penod was the Beimeuter Personality In¬ 
ventory, like the Bell Inventory in form but using moie varied questions It 
was scoied foi Neurotic Tendency (1 e, adjustment), Self-Sufficiency, Intro- 
veision, and Dominance 
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A study of this scale by Flanagan marked the introduction of trait scores 
defined according to statistical rules Flanagan adopted the principle that to 
deserve separate names, traits must have low correlations He mtercorre- 
lated the Bernreutei scoies of 305 adolescent boys (Table 61) and found 


TABLE 61 Intercorrelations of Bernreuter Scores for Adolescent Boys 



Neurotic 

Tendency 

Self- 

Sufficiency 

Introversion 

Dominance 

Neurotic Tendency 


-.39 

.87 

- 69 

Self-Sufficiency 



- 33 

51 

Introversion 




- 62 

Dominance 






Source Flanagan, 1935 


tire traits by no means independent “Introversion,” as there measured, is 
little different fiom “Neuiotic Tendency,” since items on social isolation and 
daydreaming cany large weight in both scales, Applying factoi analysis, 
Flanagan found that Confidence and Sociability scores could account for the 
information earned by the foui original keys, and he developed scoimg keys 
for these traits The scores cori elate negligibly and thus do repiesent inde¬ 
pendent aspects of the self-report 

There followed a period when personality theory was wholly subordinated 
to a statistical search for “dimensions” which could summame peisonahty. 
Item mterconelations led Guilford, for example, to suggest that mtroveision 
could be sepaiated into social mtroveision (S), thinking mtioversion (T), 
depression (D), cycloid tendencies (frequent shifts of mood) (C), and 
restraint (Zl), Accordingly, he developed the Inventory of Factois 
S-T-D-C-R Latei he added eight more aspects of personality The Guilfoid 
scales weie not uncoirelated (resembling veibal and numerical reasoning 
scores m this respect) Other investigators therefore learianged them into 
scoring patterns which they legarded as more efficient Thurstone, foi in¬ 
stance, accounted for much of the infoimation m Guilford's thuteen scores by 
seven factois which he renamed reflective, sociable, emotionally stable, 
vigorous or masculine, ascendant 01 dominant, active, and impulsive This 
game is interminable. One psychologist classifies the items finely, the second 
puts some of the small bundles together, the third redivides the laige bun¬ 
dles in a new way—and each gives his own names to the factors'. Until trait 
lists can be tied down to a definite theoiy 01 to external cuteria, choice can 
be made only on aesthetic grounds Theie is at piesent no consensus among 
factor analysts as to the number of factors that have been reliably identified, 
the best organization of them, or their most appropriate names. 

Trait names, we may note in passing, are a source of senous confusion in 
the personality field. The meaning of "introvert” is twisted and turned so 
that it represents for one author a brooding neurotic, for another anyone 
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who would rather be a clerk than a carnival barker. “Ascendance” ranges 
from spontaneous social lesponsiveness, in one theory, to inconsiderate and 
overbearing behavioi m another The veibal coinage has been so debased 
by popular usage and by questionnaire makers that some investigators try 
to fiee themselves by coming completely new teims R. B Cattell (1957) 
has succeeded m populanzing Ins woid surgency to descnbe a ceitam pat¬ 
tern of energetic behavior, but he will surely encounter considerable re¬ 
sistance to such new-minted trait names as parmia, pi emsia, and abolition 
(akin lespectively to social extroversion, emotional sensitiveness, resistance 
to intellectual cultuie) In the present Babel of tiait names, the only useful 
way to discuss personality test data is to speak of “Guilford’s Ascendance 
score,” "CPI Dominance scoie,” or “Thurstone Ascendant scoie,” according 
to the measure used 

2. Does the Woodworth inventory employ the “sign” or the “sample” approach’ 

3. Some testers have treated Flanagan's two scoring keys as supplements to the 
four supplied by Bernreuter, reporting all six scores to describe an individual 
Discuss the advisability of this practice 

4. Do you think that most introverts are emotionally malad|usted’ Does the correla¬ 
tion of the N and / scales indicate good or poor construct validity’ 

5. What is a desirable score on a test such as the Bernreuter’ 

6 What does it mean to say that Henry falls at the 50th percentile in Sociability? 
Can his behavior be described in terms of a "habit”? 

Criterion-Oriented Tests 

Construction of personality questionnaires according to the “sign” prin¬ 
ciple used by Stiong has been lare, chiefly because cuteua in the personality 
field are disputable at best. One obvious point of depaituie is psychiatric 
classification D G Humm developed the Humm-Wadswoith Tempeiament 
Scale with empuical keys to distinguish such gioups as manic and paranoid. 
The use of this test was restricted to mdustnal psychologists given special 
training m Humm’s method, and little of the leseaich done with the test was 
repoited. 

Essentially the same appioach to test construction and many of the same 
items were used in the Minnesota Multiphasic Personality Inventory. This 
scale, published in 1942, was very lapidly accepted and remains today the 
most widely used and most widely investigated of questionnaires. Although 
strictly empuical m its original conception, it proved to be relatively inef¬ 
fective m allocating patients to diagnostic groups The test has, however, 
grown in prominence because accumulated research and clinical experience 
permit the tester to mterpiet scores. It will be discussed at length below 

7. In developing an empirical scale for college students one might develop a key 
consisting of items that distinguish campus leaders from nonleaders. Suggest 
other criteria that might mark important personality dimensions 
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8 . An investigator believes that teachers can be characterized by the trait “con¬ 
tent-centered” vs “child-centered ” Outline the procedure needed to make a 
self-report test by means of criterion keying 

Tests Derived from Personality Theory 

Whereas both the factor analysts and the empmcists developed tests by 
blind groping to find what-correlates-with-what, the more lecent trend m 
personality ^measurement is to define constiucts on the basis of personality 
theory and to piepare items specifically to elicit infoimation about those 
constructs This is not wholly new, indeed, the eailiest woik on introversion 
was stimulated by Jung’s personality theoiy But that theory had little in¬ 
fluence on the actual "tes'ts, beyond suggesting items foi tiial Today, eon- 
sideiable research is going into the Myeis-Bnggs Inventory, whose items 
and scoung keys aie explicitly dictated by Jungian theory Other mstiuments 
which lllustiate this tiend are the Edwaids Schedule (which denves from 
the Murray theoiy of needs), the Tayl aTM amfest Anxiety Scale (designed 
in connection with lesearch on Hull-Spence behavioi theory), and the C ali¬ 
fornia F scale for identifying “authontanan” peisonalities The theoretically 
onented instrument often is confined to one single trait. To validate a test as 
a measuie of even one constiuct requires extensive and painstaking reseaich, 
and it is a brave mvestigatoi who hies to advance on more than one theoreti¬ 
cal front at a time. 

9. How many scores would you consider necessary to give a complete picture of 
personality? 

10. The following items are taken from various personality inventories Is any ap¬ 
parent purpose served by the alterations in form and wording? 

Did you ever have a strong desire to run away from home? 

Yes No (Bell Ad|ustment Inventory) 

At times I have very much wanted to leave home 
True False Cannot Say (MMPI) 

11 Is any apparent purpose served by these changes of form and wording? 

Are you at ease in a iarge group of people? 

Yes No (Thurstone Temperament Schedule) 

I am a good mixer. 

True False (Calif. Personality Inventory) 

Do you like to mix with people socially? 

Almost always Frequently Occasionally Rarely Almost never (Minn 
Pers Inventory) 


DESCRIPTION OF THE MMPI 

The Minnesota Multipliasic Personality Inventory (MMPI) holds a place 
among personality questionnaires comparable to that of the Strong among 
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interest measures. It was constructed in a similar empirical manner and 
was subjected to exceptionally thorough reseaich by its authors It appeared 
at an oppoitune time, and great rehance was placed upon it dunng the iapid 
wartime and postwar expansion of clinical psychology It contributed to and 
benefited from the postwar intei est in clinical research, and as a lesult has 
been studied more adequately than any othei personality test. Theie are 689 
titles included in a bibliogiaphy covering MMPI reseaich tlnough 1954; 
at that time, the number of MMPI studies was 100 pei yeai and the rate was 
still increasing (Welsh and Dalilstrom, 1956) 

The MMPI was originally constructed by a psychologist, jtai Ice Hath¬ 
awa y, and a psychiatrist, ] C. Mdwinley, to aid in diagnosis o fi clinical pa¬ 
t ients A c ollec tion of 55 0 items was prep ared by borrowing from older in¬ 
ventories and rephrasing diagnostic cues used by psychiatrists Among the 
items to be answered “T” “F,” or (cannot say) aie these. 


I believe I am being plotted against 

It takes a lot of argument to convince some people of the tiuth 
I wish I could be as happy as othei s seem to be 
I drink an unusually laige amount of water evei y day 


The content of these items is quite diveise So me lepmt ohservahlebehavi or. 
some report fe elings-that could-not-be-ebagrved fiom the outside, and some 
^pmss-generaljiQEi^l jiftitu des^ Some items fiankly-xepoi-t--sy mp.toms-of-ab- 
normal behavior. w hnreas-ether-S, appearto have no favoiable or unfavor¬ 
able connotation 


Scoring Procedures 

Psychiatric Discriminant Keys The sconng keys weie developed with the 
intention of identifying patients with respect to such lecogmzed psychiatric 
states as hysteria. Patients of each type weie compared, item by item, with a 
so-called normal group diawn from visitois coming to a laige city hospital 
Items which distinguished paranoids from normals were counted in the Pa 
(paranoid) key Paranoid patients tend to say “Tiue” to the first of our speci¬ 
men items (“plotted against”) and it is included m the Pa key. The second 
item (“argument to convince some people”) seems to imply a paranoid in¬ 
sistence on one’s own ideas, but it does not differentiate paranoid patients 
from normals and is not m the Pa key. Instead, the evidence shows that re¬ 
sponding F to this item is indicative of hystena 
The contrast between the MMPI “sign” approach and the content-oriented 
approach of its predecessors is illustrated by the fact that certain items of 
the MMPI are also found in the Guilford homogeneous scales but are 
scored in the opposite direction. 
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Foi example, to say that most people inwardly dislike putting them¬ 
selves out to help others, that most people would tell a he to get ahead, 
are responses scored as paranoid on the Guilfoid-Maitm, wheieas 
it is found empmcally that these verbal leactions are actually signifi¬ 
cantly less common among clinically paianoid peisons than they are 
among people generally. This kind of finding suggests that paianoid 
deviates are characterized by a tendency to give two sorts of responses, 
one of which is obviously paranoid, the other “obviously” not [Meehl 
and Hathaway, 1946,] 

The original scales developed by the test authors are as follows 

1 Us —hypochondnasis'j 

2 D —depression J, the so-called "neurotic toad” 

3 Hy —hystena J 

4 Pd. —psychopathic deviate 

5 Mf —masculinity-femininity 

6 Pa —paianoia 

7 PI —psychasthema 

8 Sc—schizophrenia 

9 Ma —hypomama 

Data for die reference group of normals provide a standard-score conversion 
so that results can be plotted on a profile sheet as shown in Figure 78 Pn- 
mary significance is attached to scoies greater than 70 (50 being the average 
for the reference group) This cutoff is somewhat arbitrary, and mterpieters 
examine all peaks whether or not they cioss this hne 
Control Keys The MMPI is provided with several collection oi control 
keys intended to identify or make allowance foi exceptional response 
styles The simplest group of control keys aie known as P, L, and F 
The P scoi e is the number of times the person replies “Cannot say ” Exces¬ 
sive evasion of questions of course makes it meaningless to compare the sub¬ 
ject’s responses with the standardization group Profiles showing high P 
scores aie recognized as invalid 

There are some test items so worded that a peison who denies having 
these symptoms is almost certainly not evaluating himself fiankly One ex¬ 
ample is the “I sometimes put off until tomoirow what I ought to do today ” 
The L (lie) score is based on a count of such impiobable answers A high 
L score indicates that answers are untrustworthy but need not indicate de¬ 
liberate lying. The L key detects some cases of “faking good,” but it cannot 
be depended upon to detect faking by sophisticated subjects 

The F (false) score consists, like the Kuder verification score, of responses 
given extremely rarely A high F count reveals carelessness, misunderstand¬ 
ing, oi otherwise invalid answers The F scoie tends to be high for subjects 
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who attempt to fake bad records, because rare responses are usually un¬ 
favorable self-descriptions 

K, the four dr and most important control key, was designed on an empni- 
cal basis. It was found, early in the test development, that some quite nor- 


TorTe ? L F K Hs+5K D Hy Pd + 4K Mf Pa Pt+IK Sc+IK Mal-2K 



FIG 78 MMPI record of a male mental patient (Data from Shnetdman, 1951, p 221 Profile form 
copyright 1948, The Psychological Corporation Reproduced by permission ) 

mal individuals earn scores above 70 in Hs, for example, because of what 
have been called “plus-getting” attitudes. That is to say, these persons reply 
with such complete fiankness or self-depreciation that then response pat¬ 
terns appear abnoimal. Among patients, on the other hand, there are a large 
number whose scores remain below 70 because of defensive denial of symp- 
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toms In order to reduce the number of such misses and false positives in 
MMPI diagnosis, a key was made to measure defensiveness. The investiga¬ 
tors identified items commonly marked by the clinical cases whose profiles 
weie less deviant than they should have been The K key composed of those 
items expresses a bland “all is well” fagade, e g 

I have very few quarrels with members of my family (T) 

Criticism or scolding hurts me terribly. (F) 

Wlieieas most control scores are used simply to signal untrustwoithy pro¬ 
files, the K scale is employed in a regression formula to coirect the regular 
scales for test-takmg attitude. Thus the original Hs scale was replaced by 
Hs + .5K These collected scales became the mam keys foi the test about 
1946 Although oui discussion of validity is to come later, we can give here 
one example of the effect of the collection. When 200 noimals were com¬ 
pared with 101 hypochondriacal patients 5 percent of the normals and 62 
percent of the patients exceeded an Hs score of 69 8 After correction, a cut¬ 
ting score which picked off 5 peicent of the noimals could detect 72 percent 
of the patients (McKinley et al., 1948) The “misses” weie thus reduced 
from 38 to 28 percent In subsequent studies by othei autliois, the K correc¬ 
tion has not been found consistently valuable. 

Still a further method of identifying faking is to score separately the ob¬ 
vious and the subtle items in any key (see pp 458ff.) 

12. A client coming to a social agency has these MMPI scores! 

Hs D Hy Pd Mf Pa Pf Sc Ma 

43 45 50 50 50 68 42 67 69 

How would the interpretation be affected if the “control scores” were as 
follows- 

a. 72, L, 50; K, 50; F, 50. 

b ?, 50, L, 73, K, 50, F, 50 

c. ?, 50, L, 50, K, 72, F, 50 

d. A 50, L, 50; K, 35, F, 50 

Descriptive Interpretation of Coded Profiles Although MMPI scores derived 
from psychiatiic diagnosis, the diagnostic categories per se play little part m 
its inteipretation At some point m the late 1940’s, as Dr McKinley reached 
the pomt of retirement and Paul E Meehl became more actively identified 
with reseaich on the scale, a new viewpoint began to replace the original 
diagnostic emphasis By 1951 Meehl was ready to say (Meehl, 1951): 

These days we are tending to start with the test, soit people on the 
basis of it, and then take a good look at the people to see what kind of 
people they are. This, of course, is different from the way in which the 
test was built, and diffeient from the usual psychiatrist’s notion of a test 
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where you start with groups of people soited out on some basis-—foi in¬ 
stance, by formal psychiatric diagnosis—and you try to build a test 
which will guess 01 predict or agiee with that . cuteiion. The 

idea that the pnmaiy function of psychometucs is to permit me to 
prophesy what the psychiatrist is going to say about somebody is 
not a very poweiful way of looking at the Multiphasic 

Foi Meehl’s puipose, any set of reasonably uncoirelated scales would pie- 
sumably have been at least as appropriate as the psyclnatncally oriented 
keys of MMPI By the time this new viewpoint emerged, so much experi¬ 
ence had been accumulated on the psychiatnc scales that they could be re¬ 
placed only with gieat loss The subsequent woik on the test has been an 
attempt to woik out meaningful inteipietations foi obsoleLe scales 

In arriving at a descnption of personality, it is customaiy to consider the 
salient features of the MMPI piofile simultaneously The high and low 
points are listed in teims of the code numbeis given above For example, a 
32-6 piofile is one m which scales 2 and 3 aie exceptionally high with 3 being 
highest, and 6 is exceptionally low (This individual, that is, has high counts 
m depressive and hystenc l espouse categories, and is veiy much unlike the 
paranoid ) Some clinicians use extiemely elaborate codes, but two- or thiee- 
digit codes aie sufficient, if moie detail is lequned, the onginal profile is 
more satisfactory than the code 

When they introduce a numencal code, the MMPI developeis attempt to 
sidestep some of the consequences of having staited ongmally horn psy¬ 
chiatric diagnoses. The counseloi should nevei tell a client that he has “a 
high schizophiemc scoie” Such labels confuse even tiamed psychologists 
when the test is applied outside the mental hospital Thus, although the test 
lecoid ioims still cany the labels Pd, Sc, etc, Meehl (1951) advises. 

If you can, get into the habit of using the code to talk about curves, 
instead of talking about the psychiatnc category names It’s woist 

to talk about the schizophrenia key, it’s bettei to talk about the Sc key, 
it’s best to talk about code 8 That is, of couise, entnely m line with 
starting with the lest and looking at the people, instead of tiymg 
to guess the diagnosis When you are woiking chiefly with relatively 
noimal individuals, it is still more desnable to avoid the psychiat- 
11 c implication. . If you talk about the 87’s and the 23’s, then you 
can set up lelatively fiesh associations with the significance of those 
numbeis 

Psychological significance is given to coded patterns by cumulating experi¬ 
ence with each type The principal depositoiy of this information is an Atlas 
which gives descriptions and case histories for neaily 1000 psychiatric pa¬ 
tients, classified by profile type To take one example, a 50-year-old man 
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tested upon admission showed score 2(D) over 70 with 3 (Hy) next high 
est, and with 9 and 6 (Ma and Pa) as low points. The staff diagnosis was 
“psychosis, manic-depi essive, depiessed state.” Aftei a month of treatment 
m a hospital, his profile had changed so that 8 and 9 (Sc and Ma) weie 
above 70 and 2 ( D) was lowest, but high L and F made this lecoid suspect 
At tins tune the staff changed the diagnosis to “paranoid,” because the 
symptoms had changed. The thud test for this patient came two years latei, 
upon a leadnnssion At this tune, he had returned to the 23 pattern, with no 
very low scores. His diagnosis was again manic-depi essive, depiessed The 
case history from the Atlas follows (Hathaway and Meehl, 1951, p 120) 

The admission of this patient with seveie depiession of about two months' duia- 
tion was the latest of seveial such episodes, with seclusiveness, pool memoiy, in¬ 
ability to woik, and somatic complaints the outstanding chaiactenstics of the de¬ 
pression When he was woikmg, he misplaced tools, and he was convinced that 
people watching him noticed the poor quality of his woik. He complained of fail¬ 
ing memory, he slowed down physically and mentally, he suffeied fiom insomnia, 
theie was loss of appetite, and in geneial he lost contact with his sunoundings, 
Lacking eneigy, he found it very difficult even to diess himself oi go to meals His 
speech was letaided and incoherent at times 

A yeai befoie admission theie had been a similai attack fiom which he had le- 
coveied aftei six electioshock tieatments Until this first attack, lus behavior had 
always been noimal His intelligence was aveiage A shy person, not socially ag¬ 
gressive, withdrawn, and moderately leligious, he had always been kind and had 
nevei lost his tempei A piemantal dependency on his mothei was later tiansfened 
to lus wife His geneial adjustment to society was adequate although he was 
known as a “dnfter,” and at best he held only semiskilled jobs Tlnoughout his life 
theie had been a histoiy of cyclic mood swings m which he moved from periods 
of elation to periods of depiession 

On admission he showed rather severe psychomotoi letardation He did not 
appeal to be delusional He had no paianoid ideas, noi was he suicidal. His sen- 
sonum and intellect weie intact, and he had some insight into the depiession 
Theie was maxked apathy, but he was coopeiative and expiessed the hope that he 
might be “well again” Aftei shock theiapy, which brought about lathei marked 
change, he became moie disonenled, caieless, and talkative He showed some 
regiession, answcied questions foolishly, and was occasionally euphoric, unco- 
opeiative, loud, and demanding With the continuation of the shock tieatments 
his behavior became moie acceptable and five days aftei the last of the eleven 
treatments he was dischai ged with almost complete remission of Ins symptoms At 
that lime it was felt that he would piobably have anothei depiession Twenty 
months later the patient returned to the hospital Following his fust discharge, he 
had been euphoric and unstable for about thiee months, then had begun to slip 
into another depression which peisisted until the second admission He was dis- 
chaiged after seven shock tieatments and was to return to die outpatient clinic 
foi suppoitive care The piognosis about a fuithei relapse was veiy guaided 

In this record we may note several points, the first being the essential con¬ 
sistency of the two lecoids taken upon admission, two years and eleven 
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shock treatments apart The intermediate record, taken while behavior was 
disorganized, showed a markedly different pattern, but its truthfulness was 
challenged by the control scoies. The high D scoies m the admission pro¬ 
files are consistent with the clinical picture, and the high IJy scoies are con¬ 
sistent with his “withdrawn, kind, dependent” exterior The inadequacy of 
his defenses, however, led the psychiatric staff to classify him as psychotic 
rather than neurotic. This fflustiates the important point that even though 
MMPI scales use psychiatnc language, they are descnptions of personality 
patterns rather than dnect diagnoses 

Theie is no simple translation fiom MMPI information into descriptive 
terms. The user of the test must build up a lepeitoiie of information from the 
Atlas, from other studies scatteied through the liteiatuie, and from his clini¬ 
cal experience Foi our purposes, the meanings of the scales can be intro¬ 
duced by Black’s study with an adjective checklist Each of 200 women stu¬ 
dents at Stanford rated other gills lesidmg m hei dormitory by checking 
adjectives which best described hei The gul also descnbed heiself on the 
checklist and took the MMPI The statistical tabulation then showed which 
adjectives were applied to gills with high scoies on any MMPI scale Table 
62, based on a portion of the results, shows what reputation (ie., typical 
overt behavioi) and what “published” self-concept goes with each MMPI 
score. The tabulations aie based on small groups and are theiefoie indic¬ 
ative of geneial trends lathei than of well-established associations. 

Studies such as this extend MMPI inteipietntion to noimal personalities 
The vanous MMPI scales do seem to depict chffeient types of peisonality. A 
high scoie on 9 (Mn) may not indicate pathological lack of control, but it 
does indicate a coloiful, dynamic, self-asseitive person Many other scales 
pick out lecogmzable types of overt behavior 

The diffeience between the self-iatmgs and ratings by others is striking, 
The girls fiequently use favoiable adjectives to desenbe characteristics 
which others describe m less flatteimg teims The high 9’s say, foi exam¬ 
ple, that they aie enteipusing and couiageous, while olheis call them boast¬ 
ful and selfish The self-descuption m some cases, indeed, is diametncally 
opposed to the leputation The high 9’s see themselves as popular and peace¬ 
able, but acquaintances apply these adjectives to them laiely This strongly 
reinforces the view that the self-descnption is a statement of what the per¬ 
son believes about himself oi what he wants othcis to believe, and not an 
adequate repoit of tyjucal behavioi On the othei hand, the MMPI’s indirect 
“sign” inteipretation of the self-description may come close to typical be¬ 
havioi by stating what a claim of populanty “leally” means 

Supplementary Keys Numerous mvestigatois have tired to develop sup¬ 
plemental y keys to identify subgioups of vanous types Welsh and Dahl- 
strom (1956) mention 100 supplementaiy keys, including scales for socio- 
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economic status, dominance, prejudice, and social mtroveision. These scales 
have not gained wide usage. 

Among the many scales denved fiom the MMPI, one has attained special 
prominence in clinical diagnosis and research, and it, oddly, was not intended 


Stable 62. Typical Behavior and Self-Descriptions Associated with MMPI 
Scores of College Women 


Scale or Pattern 


Description by 


with High Score 

N 

Dormitory Mates 

Self-Description 

2 Depression 

16 

Shy, not energetic, not re- 

Shy, moody, not energetic, 



laxed, not kind 

not relaxed, not decisive 

3 Hysteria 

25 

Many physical complaints, 

Trustful, friendly, not emo- 



flattering, not partial, not 
clever 

tional, not boastful 

13 or 31 Hypo- 

9 

Many physical complaints, 

Affectionate, partial, not or- 

chondriasis 


indecisive, high-strung, se- 

derly, not conventional 

with hysteria 


elusive, eccentric, apa¬ 
thetic 


4 Psychopathic 

26 

Incoherent, moody, partial, 

Dishonest, lively, clever, not 

deviate 


sociable, frivolous, not 

adaptable, not friendly, 



self-controlled 

not practical 

5 Masculinity 

15 

Unrealistic, natural, not 

Shiftless, not popular, unemo- 



dreamy, not polished 

tional, not having wide 
interests 

5 low Femimn- 

68 

Worldly, not energetic, not 

Self-distrusting, self-dissatis- 

ity 


rough, not shy 

fled, sensitive, shy, un¬ 
realistic 

6 Paranoia 

24 

Shrewd, hard-hearted 

Arrogant, shy, naive, sociable 

7 Psychos- 

20 

Dependent, kind, quiet, not 

Indecisive, soft-hearted, de- 

thema 


self-centered 

pressed, irritable 

9 Hypomama 

52 

Shows off, boastful, selfish, 

Enterprising, |ealous, coura- 



energetic, not loyal, not 

geous, energetic, popular, 



peaceable, not popular 

peaceable, self-confident 


Source After Black, see Welsh and Dahlstrom, 1956, pp 151-172 


foi practical use Spence and Taylor wished to test the effect of anxiety 
upon learning, m an extension of Hull’s theory of duve (Spence, 1958). 
They presumed, fi om pi evious theory, that persons with marked, admitted 
anxiety symptoms had higher levels of drive and thus would moie quickly 
acquire a conditioned defense reaction In ordei to identify extieme gioups 
by a simple, unsubtle measure of anxiety, Tayloi requested experienced 
counselors to choose MMPI statements which constituted oveit admissions 
of anxiety. The items so selected weie combined into a shoit question- 
nan e and used for the laboratoiy studies of learning When a puff of an onto 
the eye was associated with a blight light, eyebhnk responses to the light 
stimulus weie far more numerous among the “nonanxious” subjects (J. Tay¬ 
lor, 1951) The questionnaire was subsequently adopted by many investiga¬ 
tors and clinical counselors, undei the name of the Taylor Manifest Anxiety 
Scale The scale has not been standaidized, validated, or published m the 
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usual sense, and it appears to Lave no special vntues to recommend it for 
clinical purposes over other adjustment indicators 


VALIDITY OF INVENTORIES FOR SPECIFIC DECISIONS 
Screening of Deviant Personalities 

Separating Patients from Normals Although the MMPI was designed with 
psychiatric diagnosis as a cntenon, the ^uthois eaily abandoned claims that 
thetestTiad great power as~a discriminant Some of the scales had reason¬ 
ably satisfactory relations to the criterion, but otlieis weie regarded as ques¬ 
tionable even at the time of publication In papeis by the test authois one 
finds many lemarks like these “The evidence foi the validity of Ma is cer¬ 
tainly not conclusive” The published veision of the Sc scale “was only 
slightly better than the ones that were 1 ejected” “The Pi scale has nevei 
been considered veiy satisfactory ” And foi Pa, “Cross-validation was al¬ 
ways disappointing,” This frankness is rn welcome contrast to the glowing ac¬ 
counts of other test developers who have made less effort to validate then 
instruments 

Under favoiable conditions, die various scales (except 5, 7, and 8, which 
are especially weak) have more or less the same discriminating power. A 
cutting scoie which yields 5 percent false positives among noimals will iden¬ 
tify from 62 to 74 percent of the patients in the categoiy to which the scale 
corresponds The piecise chaiactcr of the data is indicated m Figure 79, 
which shows the distribution on scale 2 (D) of 690 noimals and 35 patients 
who had previously been identified as clinically depressed The data show a 
distinctly higher mean for the patient group, 63 peicent of them falling at oi 
above 73, which is the 95th percentile for noimals It is hard to say without 
fulther analysis whether the screening validity is high enough to be useful, 
In order to face this question, we must take into account the number of true 
depressive cases in the population likely to be tested (Meehl and Rosen, 
1955) 

As an lllustiation, let us assume that among the persons coming to a clinic 
50 percent are depressed Then let us change this figuie to othei base rates 
20 percent, 5 percent, and 2 peicent Figuie 80 plots the piobabihty that a 
person with each scoie will be depressed, using the distributions of Figure 
79 with each base rate m turn Again, we see the clear relation between score 
and probability of being properly called a depressed patient The value of 
die test in diagnosis depends heavily on the base rate. One will be right 80 
percent of the time if classifying a person with a score of 70 on scale 2 as 
depressive —if depressives constitute half of a clinic’s intake If, as is more 
piobable, die proportion of depressives is 20 percent, one must shift the cut- 
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Standard Score 

FIG 79 Distributions of normal and patient groups on MMPI scale 2 (D) (Starke R 
Hathaway and J C McKinley, 1942) 

ting score to 83 to have the same confidence m a boiderlme diagnosis Such a 
shift, however, leaves over half the depressive group undetected A poor D 
score is therefoie fai horn dependable as an indicator of maladjustment A 
good score does peimit confident judgments, with any reasonable base rate, 
the testei can be suie that a person scoring 50 or below is very unlikely to 
be depressive If these low scores are passed over while the remaining 
cases are submitted to further interviewing or testing, virtually no depies- 
sives will be overlooked 


Base Rate 



30 40 50 60 70 80 

T Score on MMPI Scale 2 (D) 

FIG 80 Probability of correct Identification of depressive* 
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Military Screening. Essentially similar results have been found for other 
modern questionnaires, though the validity of any test varies from situation 
to situation Short, undisguised questionnaires have been of distmct value for 
screenmg military populations (W A Hunt and I. Stevenson, 1946) An 
inventory of only twenty items was piofitably used by the Navy to determine 
which men should be seen by psychiatrists for examination and possible dis¬ 
charge as unfit With a cutting score set to allow 5 percent false positives, 
53 percent of the discharges could be identified A questionnaire given 
2081 Seabees successfully identified 281 cases later adjudged by psychia¬ 
trists to present neuropsychiatiic conditions, missed 16 who came to the psy¬ 
chiatrists’ attention through difficulty on duty, and falsely picked up 244 
men judged normal upon further study (D H Harris, 1945). This means 
that the tests peimitted psychiatrists to omit individual interviews with 1540 
men—not a tnflmg saving 

Screening of Students Attempts to screen students to find those requiring 
counseling have been more disappointing Over 800 college students were 
interviewed repeatedly during the year by counselors who then made a diag¬ 
nosis of the kind and extent of maladjustment (Dailey, 1937) The Bell Ad¬ 
justment Inventoiy given at the start of the year identified 40 students truly 
having pioblems lelatmg to home adjustment, but missed 41, and produced 
73 false positives. On emotional adjustment, there were 32 hits, 75 misses, 
and 42 false positives. In a study of the Beinreuter, an exceptionally good 
criterion was used—obseivation recoids gatheied continually during the 
year, Of sixteen girls at the maladjusted extreme on the Neurotic Tend¬ 
ency scale (out of 81 subjects), only six were considered actually malad¬ 
justed, whereas two of those least maladjusted according to the test were 
rated maladjusted on the critenon The Self-Confidence scale was more 
successful Ratings agreed with test scores foi all ten girls showing extremely 
low confidence on the test, and for six of eight whose test scores showed high 
confidence (Feder and Baer, 1941). 

Although errors are too frequent to warrant tiust m questionnaires as in¬ 
dicators of maladjustment, scores have validity better than chance. Stogdill 
and Thomas (1938) found that the mean score among students reporting 
voluntarily for counseling was significantly deviant on the Flanagan keys, 
The con elation of Self-Confidence with rated maladjustment was 59 for 
men Overlap m scores between normal and counseling gioups was too great 
for screening validity Correlations with ratings weie of negligible size m a 
gioup of piobationary students leferred for assistance, any such test is more 
hkely to be valid in groups seeking assistance than m groups who are unco¬ 
operative, 

Identifying problem cases by self-report methods at earlier ages has 
proved to be very difficult. Investigators who compare known delinquents 
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with normals find some differences m scoies, but the scoies of the groups 
overlap so much as to discourage reliance on the tests for screening On the 
Heston Personal Adjustment Inventory, 30 to 48 peicent of delinquents 
(compared to 8 to 16 peicent of a matched control gioup) fell below the 
20th percentile on Emotional Stability, Confidence, and two other scales 
(Hathaway and Monachesi, 1953). This lelation is far pooier than the level 
of discrimination achieved by psychiatric interviewing 
An exceptional study tested all ninth-graders in Minneapolis and followed 
the 4000 cases for two yeais. Predictive validity was examined by comparing 
those who later became delinquents with the remainder Several significant 
differences weie found, the mam lesults being indicated in Table 63 Scales 
F and 4 were prognostic of delinquency, and codes 2, 5, and 7 were rela¬ 
tively raie among delinquents The F scale pioved to be the most indicative 
of potential delinquency. Scale 4 indicates an actmg-out, impulsive person¬ 
ality, insensitive to social controls 


TABLE 63. Rate of Juvenile Delinquency for Various MMPI Profile Types 


High Score 

Percent 
of All Boys 
Falling in 
Code Class 

Percent of 
Boys in Code 
Class Who 
Became De¬ 
linquent 

Percent 
of All Girls 
Falling in 
Code Class 

Percent of 
Girls in Code 
Class Who 
Became De¬ 
linquent 

F 

5 

48 

3 

22 

4 {Pd ) 

21 

28 

19 

12 

9 (Ma) 

21 

22 

17 

8 

2 (D) 

4 

12 

1 

3 

5 (Mf) 

5 

9 

17 

5 

7 <Pf) 

6 

19 

7 

4 

Total, all 
classes 

100 

22 

100 

7 


Source Hathaway and Monachesi, 1953, p 131 


13. What does Table 63 indicate about the practical value of MMPI for screening 
potential delinquents'? 

14 Is concurrent or predictive validity the primary concern in screening studies? 

15. What importance can be attached to the finding that Pd scores decrease 
markedly with age? 

Limitations of Validation Studies The studies above are based on a con¬ 
sideration of one score at a time Writeis on the MMPI have stressed judg¬ 
ments based on all scores together It is well established that a deviant 
gioup has high aveiages on several scales, not just the one “appropnate” to 
its diagnosis, and this is a reasonable finding since a disorder often mvolves 
many types of symptoms. The aigument is made, therefore, that screening 
should take into account many scores at once by means of a linear combma- 
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tion, a nonlinear combination, or a multiple cutoff pattern (See Chapter 12 
for the distinctions between these methods.) While one can find dozens of 
papers arguing for such screening methods, it is difficult to locate evidence 
about the accuiacy of multiscale analysis or evidence companng it to single¬ 
scale screening Statistical studies of this question aie few and inadequately 
reported. One reason, of course, is the shift m emphasis fiom actuarial m- 
teipretation to desciiptive interpretation shortly aftei the MMPI scales 
were put mto then final form, another, the fact that too few cases of any one 
pattern are found for adequate statistical summary Some evidence on the 
use of MMPI patterns to distinguish between types of patients is piesented 
below 

Befoie tiymg to explain the geneially poor perfoimance of inventories as 
screening devices, let us emphasize ceitam aspects of the design of valida¬ 
tion studies These points are important because some articles lcpoit staking 
success in piedicting vanous catena, and m many instances the apparent 
success meiely lesults fiom impropei analysis. The fiist enoi to be noted is 
validation of a key or sconng formula on the same cases used to select items 
and establish weights As was pointed out in Chapter 12, cioss-validation is 
essential to avoid giving ciedit for chance disci immations peculiar to the 
sample studied 

A second common fault is to demonstrate significant diffei ences (e g, be¬ 
tween delinquents and noimals) without examining the base rate The use¬ 
fulness of a screening or categonzmg mstiument depends upon the number 
(not the percentage) of misses and false positives at any cutting scoie Given 
enough cases, highly significant diffeiences can be established for instill¬ 
ments which have no practical value 

A similar remailc is to be made about comparisons of extreme groups 
Gough (1957) shows a very significant diffei ence on the Sa scale of the Cali¬ 
fornia Psychological Inventoiy between boys nominated by pnncipals as 
most and least self-accepting This compaiison is based on 52 boys m each 
gioup, selected from among six high schools Gough computes a bisenal cor¬ 
relation of 46 for these data. But as can be seen in Figuie 20, tlieie can be 
a gieat difference between extreme gioups even when the con elation based 
on the entire population is very low A conelation coefficient must be com¬ 
puted on (oi estimated foi) the entue population to whom the lest will be 
applied If, as Goughs leport seems to indicate, the pnncipals nominated 
the extieme 1 percent of their student bodies, a recomputation indicates that 
the tiue validity of Sa is approximately 15, not 46 

Explanation of Results The discouraging results foi even the best available 
inventones can be explained in two rather diffei ent ways The defender of 
the inventory will argue that the evidence is, on the whole, favorable, the 
cntic will argue that the inventory is inefficient either m principle or because 
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of poor design The defender can argue that the criteria used in validation 
and m scale construction are themselves invalid, and, indeed, that a test 
which pi edicts diagnosis perfectly would be far fiom a tiue picture of pei- 
sonahty The diagnosis of maladjustment is controversial at best Psychia¬ 
trists disagiee as to what categones should be used and disagiee m theii clas¬ 
sification of individuals Clinical staffs have such marked biases toward the 
use of certain diagnoses that it has been said, only half jokingly, that 
whether a patient is called psychotic 01 neurotic depends as much on the 
hospital he enters as on his symptoms 

Meehl insists that the diagnosis is at best only a staitmg point, Just as Bmet 
started with blight pupils selected by teachers, the Minnesota testeis started 
with diagnosed hypochondriacs, but the intention in both cases was to de¬ 
velop an instrument which would be superior to the staitmg cntenon —1 e, 
which would m the end disagree with it The Bmet scale, they point out, has 
value precisely because it detects blight childien that teachers overlook, 
and collects the teacher’s overfavorable evaluation of other cases While the 
attack upon diagnoses is legitimate, one must be waiy of any implication 
that when test and psychiatust disagree the test is the more dependable 
Evidence to suppoit this type of claim has not been developed foi the MMPI 
as it has for the Binet scales 

A second peitment defense is that m many studies the criterion is crudely 
deteimined, even if in principle it could be made dependable Thus the data 
on depiessives presented above aie probably unfair to the MMPI The pa¬ 
tient group included some cases who might have recovered fiom then 
depressive phase befoie testing, and the normal group, including as it did 
unhospitalized lelatives of patients, may well have included numerous un¬ 
detected depressives More generally, nonpatient status is no guaiantee of 
sound peisonality, many peisons in the community have scuous maladjust¬ 
ment which remains undetected only so long as they are exposed to no ex¬ 
ceptional stress The fact that the pressuies of life arc not the same for all 
peisons greatly reduces the prospect of piedicting behavioi from peisonal- 
lty measuies 

It remains a question whethei some other investigator could produce a bet- 
tei scieemng mstiument than the MMPI There are many leasons for think¬ 
ing that it is far from the most efficient actuarial mstiument that could be 
developed In the oiigmal denvation of scales, the number of cases of each 
patient gioup was generally below fifty, and often below thnty, as a conse¬ 
quence, chance may have played a large part in assigning items to scales 
Moreovei, the patients and normals were quite diffeiently motivated m tak¬ 
ing the test, and neither had the motivation likely to he encountered when 
the test is used foi screening. The scales have lower stability than desirable, 
the median for normals over one week being 80 (Cottle, 1950a) Items were 



■484 ESSENTIALS OF PSYCHOLOGICAL TESTING 

combined with little regaid for their intercorrelations, yet in actuarial pre¬ 
diction it is profitable to maximize item heterogeneity so as to laise the cor¬ 
relation of the scale with the criterion. Separate scoring of subtle and ob¬ 
vious items would probably be of value, despite the empirical origin of 
MMPI keys, theie is evidence (McCall, 1958) that the obvious face-valid 
items carry almost all of the discriminating powei. Finally, although it is 
evident that combining several scales can impiove differentiation, no for¬ 
mula for combining the scales to separate noimals from patients has been 
systematically validated Even with improved test construction, the varia¬ 
tion from one population to another (eg., between a community clinic in a 
small town and a city hospital) may be so gieat that even the most powerful 
actuarial test will have very limited general validity. 

We may summarize the findings on personality tests as screening instru¬ 
ments as follows 1 In dealing with large populations (militaiy recruits, college 
students, etc ) where individual attention cannot be given to everyone, 
questionnaires validated on that type of population aie of great value as a 
piehmmary scieen Persons with better scores can be passed over while 
more systematic diagnostic procedures aie applied to the remainder The 
number of deviant cases missed is reduced if suitable methods to control 
faking are applied It is never proper to assume that those earning poor 
scoies on a questionnaire are seriously maladjusted, the number of false posi¬ 
tives makes it imperative to regard the test as only a fiist stage of investiga¬ 
tion. 

Differential Diagnosis of Patients 

The second original aim of the MMPI was to distinguish one type of psy¬ 
chopathology from another Clinical psychologists are commonly required, 
especially m mental hospitals and outpatient clinics, to make a rapid deci¬ 
sion as to the probable nature of the patient’s disorder 

MMPI profiles for various diagnostic groups differ significantly, and ex¬ 
perienced MMPI users can classify profiles with some success Guthrie 
(1950) asked them to classify 89 records into six piles (paranoid, anxiety 
state, depressed, etc ), and found that accuracy ranged fiom 36 to 54 correct 
placements for various judges. Though substantially better than chance, this 
represents rather low accuracy for diagnostic puiposes Sullivan and Welsh 
(1952) found a set of eight “pattern” characteristics (eg, score 1 higher 
than 3) which, considered simultaneously, differentiated ulcei patients 
from unselected neuiopsychiatric patients with some success (TO percent of 
ulcers correctly identified at a cost of 37 percent false positives). One is 
forced to conclude that analysis of MMPI scores, whether impressionistic or 
actuarial, is at best a source of hypotheses about diagnosis to be checked by 
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other methods In this iole, it can be of definite assistance m the clinic. 

Results on differential diagnosis with questionnanes other than MMPI 
have in general been unencouragmg, and m recent years the MMPI has 
displaced all competing questionnaires for this purpose. 

16. SVIB keys for differentiating medical specialist groups from each other had 
little correlation with the key for separating physicians from men in general. 
What does this imply regarding the design of keys for distinguishing one type 
of patient from another? 

Prediction of Vocational Criteria 

Inventories have had rather little success m predicting employee perform¬ 
ance Ghiselli and Barthol (1953) found 113 correlations between job pro¬ 
ficiency and presumably lelevant inventoiy scoies. Neaily all correlations 
were positive The average correlation was as high as 36 foi sales peisonnel 
but only 14- 18 for supervisor and foremen. Theie was a wide range among 
coefficients for the same occupation. The Ghiselli-Barthol averages are piob- 
ably umealistically high Since investigator file and forget hundreds of 
studies with small samples which showed unpiomising relations, only a 
biased selection leaches publication 

The expenence of Household Finance Corporation is consistent with these 
statistics Wondeilic and Ho viand (Mooie, 1941, p. 60) report' “Our early in¬ 
vestigations weie earned on with published tests which have been standard¬ 
ized by others We were unable to find any test m winch the total score 
was significantly piognostic of success in oui organization to wairant its 
inclusion as part of a selection progiam In the cases of many of the purchas¬ 
able personality tests, lesults were obtained which ran counter to expecta¬ 
tions. Clencal workers seemed to be more aggressive than salesmen, sales- 
cleiks were higher than manageis ” 

Inventories must inquire about typical behavioi lather than behavior un¬ 
der specific conditions Regardless of what a person is prone to do when 
given free choice, he adapts himself to the demands of different situations 
He can be asseitive as a parent, submissive in reporting to his commanding 
officer, boisterous at a paity, decorous in church. People vary m their ability 
to assume roles, but there is no evidence that one can assume convincingly 
only the roles which match his typical behavior, In this sense, personality 
is like posture; the young man who slouches habitually can be placed in 
uniform and trained to hold as rigid a military bearing as anyone else Per¬ 
sonality, as commonly measured, piobably has much to do with die soit of 
work and personal relations a person seeks, but has little to do with his abil¬ 
ity to perfoim a role when thrust into it. The adjusted person is able to 
adapt his style to role demands. 
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Dependable use of personality inventories for guidance lequnes empirical 
study of men who succeed and remain in particular occupations The most 
adequate work of this kind has been done with the Kuder Prefei ence Rec¬ 
ord—Personal Figure 81 shows some of Kudei ’s evidence that occupational 
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FIG 81 Occupational differences in preference profiles Mean scores for men satisfied in each 
occupation are shown In the columns where an occupation is not mentioned, the average score wqs 
not significantly away from the general average of 50 (Data from the manual for the Kuder 
Preference Record—Personal) 

profiles aie often quite distinctive, when more evidence is accumulated, 
occupational interpietations of profiles may be of distinct value in counsel¬ 
ing 

Inventories may have some place in employee selection The average 
couelation of 36 for salesmen found by Ghiselli and Barthol, and the sev¬ 
eral high conelations foi small samples of clerical woikeis, should not be 
dismissed Combining personality and aptitude data might give excellent 
piediction In view of the variation from situation to situation, however, one 
can piedict success or failuie only in a definite job m a specific firm. Each 
agency must develop its own prediction foimula Where a laige number of 
employees m an organization aie doing simiku woik, best piediction can be 
obtained by developing a new key foi the mventoiy using responses char¬ 
acteristic of successful employees 

The most successful device of this natuie is the Aptitude Index designed 
for use in life insurance agencies It is a combined personal history and per¬ 
sonality questionnaire, scored by an empirical formula developed after try¬ 
out on a nation-wide sample of agents The correlation with sales volume is 
40 Although individual prediction is inaccurate, selection by means of the 
Index leads to considerable improvement of average sales per man m the 
long run Ignoung those men who quit m their first yeai, it was found that 
agents rated A on the Index produced 206 peicent as much business as the 
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aveiage agent, while those rated E produced only 41 peicent of the aver¬ 
age (Kurtz, 1941). In general, foi piediction of success biographical inven¬ 
tories seem to be more satisfactory than questions about personality. 

17. Should inventories be used to advise students about their probable vocational 
success? 

18 What advantages does a biographical inventory have over a personality ques¬ 
tionnaire in employee selection? 

19 What differences among clerical pbs might account for variation in the validity 
of personality tests? 

20. How might lack of self-confidence help one student to attain high marks, yet 
be a drawback to another? 


VALIDITY OF INVENTORIES FOR TRAIT DESCRIPTION 
The Test as a Mirror for the Counselee 

In counseling, the peisonahty mventoiy is used like the mteiest mventoiy 
to help die individual examine lus own charactcnstics as in a minor He 
knows what he has said, but the test peimits him to compaie himself with 
otheis His peicentile standing m various tiaits is an appiopriate initial 
topic m counseling For this purpose, it is piobably not wise to use subtle 
scales oi scales whose meaning is difficult to communicate, since the instru¬ 
ment seivcs pumarily to i effect the counselee’s own piofessed attitudes To 
show a counselee his MMPI profile could lead only to difficulties m explain¬ 
ing the meaning of the categories, and possibly to his i ejection of inter¬ 
pretations based on subtle items 

What mventoiy is preferred will depend upon the natuie of the counsel¬ 
ing Geneial adjustment inventones or othei single-score instruments aic of 
little use m counseling since they pose few questions for discussion A de- 
scnptive scale leporting introversion, impulsiveness, and so on is of potential 
value in vocational guidance and may open discussion of traits which the 
client legaids as faults. Descriptions in teims of piefened activities (eg., 
Kudei Prefeieuce Recoid—Peisonal) and values (Allpoit-Veinon) aie 
somewhat hettei suited to vocational guidance than scales which desenbe 
emotional reactions. The Mooney Pioblem Checklist is of consideiable value 
became it diaws attention to specific concerns the client is ready to talk 
about and wants help with It is, in effect, a preliminary interview lathei 
than a measuring device 

A descriptive inventory useful m initiating counseling of college students 
is that of Edwaids The piofile describes fifteen “needs” which piesumably 
direct the subject’s actions Some of the needs, and items related to them 
are 
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Abasement—to accept blame when things do not go light 

Achievement—to be a lecogmzed authority 

Affiliation—to be loyal to friends 

Aggiession—to attack contraiy points of view 

Autonomy—to be independent of otheis m making decisions 

The items are paired and the subject chooses the goal or behavior he piefeis 
m each pair The mterpietation of the scales clings veiy close to the explicit 
content of the items, which aids communication, and yet the summary in 
teims of needs may add to the subject’s insight into himself. The counselor 
can help lnm examine how his majoi needs are currently being frustrated, 
how well his futuie plans will satisfy these needs, or how factors in his 
eailier development caused certain needs to develop 
Wheie a test is used as a reflection of the client’s lemaiks, several ques¬ 
tions related to validity aie of inteiest Only paitial answeis can be given 
here, but fuither information on patticulai tests being consideied as counsel¬ 
ing aids should be obtained fiom the test manuals 

o Are the scores adequate measrues ol the published self-concept? Would 
anothei set of items give the same piofile? This is to be answeied by parallel- 
form or internal-consistency reliabilities, or by conelations between inven¬ 
tories having similar scales. The better mventones show leliabihtics of ,80 
and above, which is sufficient to pick out salient chaiactenstics Scoies with 
the same name in different mventones may have low conelations, which 
emphasizes the need foi cautious in lei pi elation 

» Do scoies reflect lasting chaiactenstics? E L Kelly administered sev¬ 
eral questionnanes to 300 engaged couples during the yeais 1935-1938 and 
retested nearly all the subjects again in 1954 Among the instruments used 
weie the SVIB, the Benneuter, the Allpoit-Vemon, and the Remmers gen¬ 
eralized attitude scales The stability coefficients m Figme 82 show a staking 
degiee of similarity between self-descnptions given twenty yeais apait 
The intei est scores aie most stable, but when we allow for die initial unrelia¬ 
bility of the Allport scale it appeals that values aie equally stable. Peisonality 
scores aie only slightly pooiei Attitudes, on the othei hand, aie quite tem¬ 
porary While the self-concept seems to lemam l datively stable, the mean¬ 
ings attached to the lest of the world change gieally with experience. 

We may also piesent evidence on stability ansing fiom a study of chil- 
dien’s personalities These data aie not fiom questionnaires oi selt-iatmgs 
Trained intei vieweis asked motheis to tell the extent to which their clnldien 
showed such problems as insufficient appetite, nailbitmg, and quarrelsome¬ 
ness. The reports weie coded on a five-point lating scale for each prob¬ 
lem, and a total score indicating seventy of pioblem behavioi was derived, 
The procedure was repeated each yeai fi om infancy to age 14 The correla¬ 
tions m Table 64 should be compared with the conelations for repeated men- 
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FIG 82 Stability of various aspects of personality (Kelly, 1955) The dot indicates the reported 
reliability of the scale over a short time interval (usually one week) 


tal tests m Table 21 (p 176), which come fiom the same longitudinal study 
In both tables, we find that conelations fiom year to year are much higher 
at latei ages, after development is moie stabilized The stability coefficients 
for peisonahty scoies are lower than those for mental tests, but the drop is 
surprisingly small While we do not have mfoimation on the stability of 


TABLE 64 Correlation of Problem Behavior Score of Boys with 
Score at a Later Age 


Approximate Age 

Years 

Elapsed Between First and Second Score 

at First Rating 

1 

3 

6 

12 

1 2 

1 4 

38 

40 

27 

-01 

3 

50 

31 

.35 

.47 

4 

56 

57 

.54 

— 

6 

67 

70 

81 

— 

7 

73 

55 

51 

— 

9 

70 

75 

— 

— 

11 

86 

.80 




Source Macfarlane et al , 1954 


personality questionnaiie scoies for children, these data show beyond a 
doubt that pioblem behavior itself has a gieat deal of stability over at least a 
three-year span 

® Do tire descriptions agree with external evidence of typical behavior? 
Research comparing self-descriptions with objective records of behavior is 
lacking Theie have been numerous comparisons of scoies with judgments. 
The study repoited in Table 65 shows that children’s self-ieports have a 
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quite small degree of correspondence with the way then teachers and peers 
rate them (see also Powell, 1948). With college students, somewhat higher 
validities have been reported Gordon (1953), foi example, found correla¬ 
tions of 47 to 73 between scoies on his inventory and ratings given students 
by doimitoiy mates These are concuiient, not piedictive, validities, similar 

TABLE 65 Correlations Between Three Types of Adjustment Measures for 
Ninth-Graders 


Peer Self-Reports 



Teacher Judgments Judgments 
Forced 

Rating Choice 

Self- 
Ad |ust- 
ment 

Social 

Adjust¬ 

ment 

Basic 

Difficulty 

Teacher judgments 



56 

30 

33 

22 

Rating of adjustment 


77 

Score on forced-choice 
descriptive question¬ 
naire 

Peer judgments 

77 

56 

.56 

28 

29 

,15 

Nominations on desirable 
traits 

Self-reports 

56 

28 

28 

28 

,16 

California Test of Per¬ 
sonality Self-Ad|ustment 
California Test of Per¬ 
sonality Social 

30 

28 

73 

73 

61 

Adjustment 

SRA Youth Inventory, 

33 

29 

28 

47 

47 

Basic Difficulty score 

.22 

15 

16 

61 



Souhce Ullmiuin, 1952 

correlations for the EPPS weie obtained m an unpublished study by Tamkin 
and Klett (Cf also Table 62 ) 

• Do the descriptions agiee with the tine self-concept? A criterion could 
be obtained by asking a theiapist who is well acquainted with the person’s 
inner attitudes to describe him Evidence of this soit, however, is rare 
Rogeis cites coirelations ranging from 38 to .48 between scoies on Ins adjust¬ 
ment mventoiy and latings by clinicians, but this is an isolated investigation, 
not adequately lepoiterl (1931) 

21. Applying the method of p 138 to Figure 82, about what proportion of the 
variance in self-confidence is due to random error of measurement, what pro¬ 
portion to genuine but unstable characteristics, and what proportion to stable 
characteristics’ 

22. In counseling with the Edwards inventory, should raw scores (ranging 0 to 15 
on each scale) or percentiles be used to plot the profile 9 


Descriptions to Aid Institutional Decisions 

The second major descuptive use of inventories is to provide others with 
insight regarding the individual, This may be important m clinical diagno 
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sis or other institutional decisions, and in prescriptive counseling The ther¬ 
apist may wish to know as much as he can about the person’s conflicts long 
before they emerge m interviews The college counselor (see p. 456) may 
wish to know what hidden attitudes are preventing a student from doing his 
best work. For these purposes, subtle tests may have distinct advantages in 
the hands of highly experienced testers 

The MMPI is the generally preferred instrument for clinical patients and 
for many counseling uses The interpreter must bring to bear information 
from the Atlas and other sources in order to translate scores into psychologi¬ 
cal constructs An illustrative description is that given by Grayson (Shneid- 
man, 1951, pp 268-269) for a 25-year-old veteran (see Figure 78) 

This profile may, with a good deal of confidence, be considered as a valid repre¬ 
sentation (F within limits) of a senously disturbed patient (unusually elevated 
profile), who has a tremendous amount of anxiety and depression (D), with sad, 
worrisome feelings of inadequacy (high D, Pt) The patient is self-depreciative 
(low L, low K) and lacking in self-assuiance to the point of being a "compulsive 
doubter” (high ? ). He has attempted to resolve his anxiety through hysterical dis¬ 
placement (high Hs) and obsessive-compulsive mechanisms (high Pt) but these 
have been unsuccessful, with the lesult that the overwhelming anxiety has cracked 
his weak ego structure (high Sc) In addition, the patient has a considerable de¬ 
gree of hostility directed both against self (high Pd) and others (high Pa) The 
strong feelings of anxiety and depression (D) combined with introverted and ex- 
traverted aggression (Pd, Pa), m an individual who possesses insufficient ego con¬ 
trol (Sc) to inhibit his tendency to act out impulses (Ma) add up to an explosive 
picture which presents strong possibilities of suicidal and homicidal behavior Di¬ 
agnostically, the patient may be classified hs incipient paranoid schizophrenia 

The validity of the description is attested by the full case history and 
therapy protocol. The following statements are made in a summary by the 
man’s psychotherapist "He seemed suspicious, indecisive and unable to re¬ 
lax . There seems to be considerable guilt in relation to his own hostility, 
He has established some defenses against this through obsessions but these 
defenses are cracking and he fears that his hostile impulses might become 
so great that he would be unable to control them The patient seemed 
obsessed with thoughts about death, homicide, and suicide ” 

The MMPI is not entirely suitable for normal gioups, particularly younger 
ones. Some of the items (“My sex life is satisfactory”) arouse criticism from 
teachers and parents, and the scales of clinical origin produce information 
principally on undesirable traits The California Psychological Inventory 
(CPI) and Minnesota Counseling Inventory (MCI) are descendants of the 
MMPI specifically designed for relatively normal high-school and college 
students The MCI keys are labeled Family Relationships, Emotional Sta¬ 
bility, Conformity, Adjustment to Reality, Mood, and Leadership. Some of 
these keys correspond rather closely to the MMPI clinical scales in general 



492 ESSENTIALS OF PSYCHOLOGICAL TESTING 


purpose (e.g., Conformity is a substitute for Pd), but the MCI labels cairy 
less damaging connotations 

The validity of descriptive interpretations is difficult to assess, especially 
when the construct employed cannot be equated with any one observable 
behavior. MMPI scales have been given meaning by integrating evidence 
from all manner of studies, gradually formulating a psychological hypothesis 
about the meaning of each score Mechl s remarks on the Pd scale illustrate 
the process (Cronbach and Meehl, 1955) • 2 


The Pd scale of MMPI was originally designed and cross-validated 
upon hospitalized patients diagnosed Psychopathic personality, asocial 
and amoial type.” Fuither research shows the scale to have a limited 
degree of predictive and concuirent validity foi “delinquency” moie 
broadly defined Several studies show associations between Pd and 
veiy special “critenon” groups which it would be ludicrous to identify 
as “the criterion” m the traditional sense If one lists these heterogene¬ 
ous gioups and tries to charactenze them mtensionally, he faces enor¬ 
mous conceptual difficulties For example, a recent suivey of hunting ac¬ 
cidents in Minnesota showed that hunters who had “caielessly” shot 


someone were significantly elevated on Pd when compaied with other 
hunters . . The finding seems to lend some slight support to the con¬ 
struct validity of the Pd scale. But of course it would be nonsense to 
define the Pd component “opeiationally” m teims of, say, accident prone¬ 
ness We might try to subsume the original phenotype and the hunting- 
accident pi oneness under some broader categoiy, such as “Disposi¬ 
tion to violate society’s rules, whether legal, moral, or just sensible ” But 
now we . aie using a rather vague and wide-iange class . . . 

We want the class specification to cover a gioup tiend that (nonde¬ 
linquent) high school students judged by their peei gioup as least “re¬ 
sponsible” score ovei a full sigma higher on Pd than those judged most 
“responsible” . Again, any clinician familial with MMPI lore would 
predict an elevated Pd on a sample of (nondelinquent) professional 
actors Chyatte’s confirmation of this prediction tends to suppoit both 

(a) the theory sketch of “what the Pd factor is, psychologically”; and 

(b) the claim of the Pd scale to constiuct validity foi this hypothetical 
factor Let the reader try his hand at wilting a brief phenotypic ciitenon 


specification that will cover both tiigger-happy hunters and Broadway 
actors! And if he should be ingenious enough to achieve this, does his 
definition also encompass Hovey’s report that high Pd predicts the judg¬ 
ments “not shy” and “unafraid of mental patients” made upon nurses by 
then supervisors? And then we have Gough’s report that low Pd is as- 


2 References for the studies described are given in the original 
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sociated with ratings as “good-natured,” and Roessell’s data showing 
that high Pd is predictive of “dropping out of high school,” The point is 
that all seven of these “criterion” dispositions would be readily guessed 
by any clinician having even superficial familiarity with MMPI inter¬ 
pretation, but to mediate these inferences explicitly requires quite a few 
hypotheses about dynamics, constituting an admittedly sketchy (but far 
from vacuous) network defining the genotype psychopathic deviate 

This body of evidence leaves little doubt that Pd has some relation to in¬ 
ternal personality stiueture The correlations cited do not represent stiong 
relations, if they did, we would find die same peison diopping out of school, 
rated ill-natured, becoming a Broadway actor, and shooting a fellow hunter 
Circumstances dictate much of behavior Pei sonality structure, even if per¬ 
fectly measured, represents only a predisposition rather than an absolutely 
determining foice. 

./Granting that Pd and other scores have some validity, we are still uncer¬ 
tain as to the closeness of coriespondence between the scores and the true, 
hidden personality stiuctuie Before interpretations can be used with confi¬ 
dence, we requite evidence as to how often we go wiong m assuming that 
a person with high Pd has this vaguely defined pattern of arrogant, umuly, 
irresponsible attitudes. 

The facts required to assess the adequacy of descriptions are seriously in¬ 
complete, and many of the findings strike a pessimistic note Gough (1957) 
correlated a number of CPI scores with ratings of students made by a staff 
of psychological assessors These ratings are based on comprehensive psy¬ 
chological study and provide a leasonable criterion to test the statement 
that persons with certain scores tend to be seen in certain ways The correla¬ 
tions between CPI scores and the ratings to which they supposedly relate 
range from .21 to 48 Such modest couelations warn against depending 
on any single aspect of the description from the CPI. 

We may not, however, judge a description of a whole personality by the 
validities of the scales taken separately. The personality description covers 
many dimensions, and a little mfoi matron about each feature may add up 
to a revealing portrait Moreover, considering the whole pattern of scores at 
once possibly permits much more accuiate description than the single-scale 
interpretations foi which Gough gives validity coefficients Gough argues 
that the interpretation of one score depends upon the level of another in the 
manner shown m Figure 83. The interpretation might be further modified if 
other scales were taken into account, 

Cleaily needed, at this point, is evidence not now available Using single 
scales, pairs of scales, or whole piofiles, the interpreter should divide cases 
into three piles according to whether he regards them as strikingly high, 
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strikingly low, or in the middle range on some trait (eg, compliant) Th^ 
classification could be correlated with ratings by others who know the per¬ 
son well. 

From the evidence now available (see also p 592) we must continue to 
regard descriptive interpretations as hazardous. Those familiar with a par- 

compliant 
industrious 
moderate 
quiet 

Ai low- 

avvkwaid 
coarse 

self-defensive 
shallow 

FIG 83 Proposed pattern interpretation of CPI scores labeled Achievement through 
conformity (Ac) and Achievement through independence (Ai) (Gough, 1957) 

ticular test often believe that it gives them clear pictures of personality This 
may be a self-delusion nurtured by recall of successful cases, but one cannot 
deny that many inventories measure individual dilfeiences reliably and that 
those differences have some lelation to personality as observed in other ways 
When the description from the test is a point of departure for fuither study 
of the individual, errors of interpretation can be conected Under no cncum- 
stances should such a description be passed on to a school pnncipal, an em¬ 
ployer, or any other decision maker not trained to check the interpretation 
critically against other evidence 

23. Grayson diagnosed his case as "incipient paranoid schizophrenia " What diag¬ 
nosis would be suggested from the peak MMPI scores if no effort had been 
made to interpret the dynamics of the personality as a whole? 

Establishment of Scientific Laws 

A recent development is the employment of personality measures in the 
establishment of psychological theory Test scores are interpreted in terms of 
theoretical concepts and related to behavior under various experimental 
conditions. The outcome of such experiments is, first, interpretation of the 
test in terms of a refined concept rather than an ambiguous or arbitrary 
trait name and, second, development of theory as to the significance of the 
trait. In addition to the study of Taylor scores and eyelid conditioning sum- 
manzed earlier, we may mention the more elaborate study by Cervin (1957) 
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testing the hypothesis that m a two-person discussion the more emotionally 
responsive subject will take more initiative, participate more, and change 
opinion less. In order to bring out this effect, he found it necessary to use a 
specially purified measure of emotional responsiveness (akm to anxiety or 
neuroticism) rather than a published questionnaire In experimental groups 
formed by pairing high scorers with low scorers, the predicted differences 
were found m about 80 percent of the cases. 

Relations of this type can be considered well established only when con¬ 
firmed by other mvestigators As more such relations are verified and 
woven into psychological theory, tests will come to have an important role m 
the science of psychology as well as m practical decisions Theoretical clarifi¬ 
cation should also have practical consequences, though these may be far m 
the future. 

REPRESENTATIVE PERSONALITY INVENTORIES 

The following inventories illustrate the variety among currently published 
inventories, but by no means exhaust the field 

a Billett-Stair Youth Problems Inventory; Roy O Billett and Irvmg S 
Starr, World Book, 1958 Grades 7-9, 10-12 A problem checklist covermg 
such areas as health, boy-girl relationships, peisonal finance, and planning 
for the future. Designed for general screening of pupils for individual study, 
and for identification of common problems to be taken up m group guid¬ 
ance 

® California Personality Inventory, Harrison G Gough, Consulting Psy¬ 
chologists Piess, 1957 High school A lengthy inventory covering fifteen 
traits such as sociability, tolerance, and intellectual efficiency, plus three con¬ 
trol keys. The scoring keys were developed empnically but have rather low 
correlations with then criteria Interpretation is based primarily on an im¬ 
pressionistic psychological integiation of the entire profile The profile 
covers personality more broadly than most other inventories, but scores often 
intercorrelate too highly for efficient measurement Interpretation has not 
yet been adequately standardized and validated 

• California Test of Personality, Louis P. Thorpe, Willis W Clark, Er¬ 
nest W Tiegs, California Test Bureau, 1942, 1953 Primary, elementary, sec¬ 
ondary, and adult forms. A questionnaire yielding percentile scores on pei¬ 
sonal adjustment and social adjustment Such subscores as “sense of 
personal worth,” “nervous symptoms," and “family relations” have skewed 
distributions and are capable of giving meaningful information about pat¬ 
terns of adjustment only m lare cases The evidence on validity presented 
m the manual is incomplete, and misleading in places The manual attempts 
to summarize theory and practices of mental hygiene for teachers, and the 
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necessarily brief presentation runs considerable risk of being misinterpreted 
and misapplied 

• Edwards Personal Preference Schedule, Allen L Edwards, Psychologic 
oal Corporation, 1954, 1959. High school, adult The 225 paned comparisons 
lead to scores on 15 “needs ” Designed so as to eliminate “fagade” effects, hut 
patterns can be faked. The scale is recent and research is limited Gives a 
description likely to be helpful in counseling. 

9 Gordon Personal Profile and Gordon Personal Inventory, Leonard V. 
Gordon; World Book, 1953, 1956 High school to adult, Uses eighteen to 
twenty forced-choice items in each foim. The profile measuies ascendancy, 
responsibility, emotional stability, sociability; the Inventoiy measures cau¬ 
tiousness, original thinking, peisonal relations, vigoi Either can be given in 
fifteen minutes, yet reliability of scores is about 83 An efficient instrument 
for obtaining a self-description profile Evidence regarding significance of 
scores is extremely limited but encoui aging. 

© Guilford-Zimmerman Temperament Survey, J P Guilford and Wayne 
Zimmerman; Sliendan Supply Company, 1949 Adolescent and adult 
Measuies ten relatively independent traits defined thiough factor analysis, 
including ascendance, sociability, thoughtfulness, objectivity, and restraint. 
A moie efficient version of the eailier Guilfoid scales. A typical descriptive 
instrument. Little evidence on significance of scoies is available 

• Kuder Preference Record, Form A—Personal, G Frederic Kuder, 
Science Research Associates, 1948, 1953. Adolescent and adult A companion 
to the vocational interest inventory, this set of forced-choice items measures 
pieference among sociable, intellectual, etc, activities. (See Figuie 81) 
Occupational patterns have been collected which enhance the usefulness of 
the scale in vocational guidance, but little is known about the scale as a de¬ 
scriptive or diagnostic instrument, It is fiee from fagade effect, but patterns 
can be faked. Since scores have no obvious "good-bad” implications, the 
Kuder is likely to be suitable for introducing counseling, especially m high 
school 

• Minnesota Counseling Inventory; Ralph F. Berdie and Wilbur L, Lay- 
ton; Psychological Corporation, 1957. High school. Many of the 413 true-false 
self-description statements are lewritten MMPI items Seven scales measure 
adjustment to family and social relations, emotional stability, mood, con¬ 
formity, etc, Two control keys aie provided The scales have positive but 
very modest validity for separating (for example) pupils known to have 
poor family adjustment from those rated as having good adjustment. Retests 
after thiee months show reliabilities in the 70- 80 range Inteipretation of 
this instrument will remain uncertain until considerably greatei experience 
and validating evidence have been accumulated. 

t Mmnesota Multiphasic Personality Inventoiy, S R Hathaway and J. C, 
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McKinley; Psychological Corporation, 1943, 1951. Late adolescents and 
adults. (See pp 469 if.) 

s Mooney Problem Check Lists; Ross L. Mooney and Leonard V Gordon; 
Psychological Corporation, 1948, 1950 Forms for junior high through col¬ 
lege, and adult. Subject checks his problems in eleven fields' morals and 
religion, finances and living conditions, adjustment to school work, social re¬ 
lations, etc High scores identify those who should receive counseling, and 
items checked provide a basis for individual or class discussion 

® SRA Youth Inventory, H H Remmers and Benjamin Shimberg, Science 
Research Associates, 1949. High school, also, SRA Junior Inventory for 
Grades 4r-8, 1955 A checklist of unusually efficient format covering typical 
adolescent problems of educational and vocational planning, and social and 
emotional adjustment. Chiefly useful as a starting point for individual and 
group guidance, may also be used as a screening inventory to detect indi¬ 
viduals requiring intensive study 

« The 16 P F. Test, R B. Cattell, D. R, Saunders, and Glen Stice, In¬ 
stitute for Personality and Ability Testing, 1950 Age 16 and over Sixteen 
scores measure dimensions such as dominance, general intelligence, emo¬ 
tional stability, radicalism, and will control. The dimensions are lelatively 
independent and have some advantages for research purposes The short 
scales have extremely low reliability (.45- 55) and the information on noims 
is unsatisfactory. Not recommended for assessment of individuals (Versions 
of the test for vanous school ages are either available or in preparation.) 

• Study of Values, Gordon S. Allport, Philip E Vernon, Gardner Lindzey, 
Ploughton Mifihn, 1931, 1951 Later adolescence and college. Forced choice 
between preferred activities and beliefs Scored according to Sprangei’s sys¬ 
tem to indicate relative emphasis on Theoretical, Economic, Political, 
Aesthetic, Social, and Religious values Of some value as a supplement to 
interest inventories in vocational guidance, much used for reseaich in social 
psychology 

® Survey of Study Habits and Attitudes, Wm F Brown and Wayne H. 
Holtzman; Psychological Corporation, 1953 College students. Covers study 
behavior and attitudes (e.g., “Whethei I like a course or not, I still work hard 
to make a good grade”) Out of 75 items on this questionnaire about half are 
keyed; these distinguish students with good marks fiom those who do pooily 
This score correlates about .45 with grades and, combined with an ability 
test, yields a predictive validity of about 60 The test is fakable and is not 
recommended as an admission test Both the total score and the item re¬ 
sponses are useful m counseling and m how-to-study courses. 


Three levels of test were mentioned in Chapter 1' Level A, appropriate 
for use by teachers and others without special training m testing; Level B, for 
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use by counselors and others with a good general understanding of testing, 
and Level C, for use by persons with considerable psychological knowledge 
and relevant supervised experience. Most personality tests belong to the 
higher levels, because their interpretation requires considerable judgment 
Theie is some risk that the mtcrpietation will cieate difficulties which only 
a professional counselor or clinical psychologist is likely to recognize For 
example, to tell a subject that he is low in emotional stability may aggravate 
his difficulties. Giving the same facts to his employer—no matter how cau¬ 
tiously presented—may blight his chances of promotion and ultimately in¬ 
crease his maladjustment Allowing a test to damage the person's opportuni¬ 
ties and self-satisfaction would be dubious even if the test were highly valid 
Since it is not, the lesults of personality tests should generally be reported 
only to professional workers who know their limitations In the light of these 
dangers, the author suggests the following categorization of the tests listed 
above 

A. Can safely be interpreted by teachers 

• Mooney, SRA, and Billett-Stan inventories. Teachers should not at¬ 
tempt to analyze individual test scores, and unless a counselor is to interpret 
individual records pupil’s answer sheets should probably be unsigned A 
tabulation of the frequency of paiticular problems is an excellent basis for 
group guidance, cumculum planning, and modification of school conditions 
which create problems. 

B. Can safely be used by the counselor with basic training in vocational 
and educational counseling 

• Allport-Vernon-Lindzey, Kuder Personal These inventories reflect the 
person’s prefeired choices and in that respect resemble inteiest inventories 
Interpreting the scoies is unlikely to thieaten self-esteem 

• Billett-Starr, Mooney, SRA inventories These may be used to identify 
pupils foi interviewing, and as a starting point for interviewing Little atten¬ 
tion should be paid to the scores themselves The counselor should not at¬ 
tempt to resolve deep emotional conflicts that call for experience he does not 
have 

B' Can safely be used by counselors with considerable training in per¬ 
sonality theoiy and handling of emotional conflict 

® Bell, California (CTP), Edwards, and Goidon inventories. These 
should ordinarily be intei pi eted as a part of an individual case study, except 
in research studies it is rarely advisable to apply them routinely to groups. 
Since some scores m the Bell, CTP, and Gordon may threaten the individ¬ 
ual, the counselor should consider carefully before deciding whether to m- 
teipiet the test to the subject 

C Require comprehensive training m counseling psychology or clinical 
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psychology, including understanding of test theory, personality theory, and 
handling of emotional conflict 

» CPI, Guilford-Zimmerman, Minnesota Counseling Inventory, MMPI. 
Further reports of research on the meanings of profiles on some of these in¬ 
ventories may ultimately make them trustworthy m the hands of counselors 
at Level B or B' Most promising in this respect is the relatively simple 
MCI This instrument is not difficult to interpret, hut its use m schools carries 
some risk. The Leadership scoie, for example, is not very valid, it may be 
that if such a score is made available to teachers they will give leadership 
opportunities to pupils with high scores and deprive pupils with low scores 
of this valuable learning expeiience. 

24. A Leadership score identifies pupils whose responses resemble those of other 
pupils who have become leaders. What characteristics other than leadership 
ability and interest are likely to distinguish student leaders in high school from 
the students who take little part in student affairs? 

25. If school officials make use of leadership scores in encouraging certain pupils 
to take leadership responsibilities, will this tend to increase or decrease the 
correlation between the original scores and leadership record by the end of 
high school? 

26 Scores on certain instruments purport to identify students likely to be trouble¬ 
makers and potential delinquents. Assuming that such a score has very high 
stability and validity, what use might be made of such a test by high schools? 
If, as is the case, the validity coefficients are quite low, what undesirable ef¬ 
fects may follow if such scores are collected by principals? 

27. The restrictions on use of personality inventories in counseling suggested above 
are admittedly conservative. Some psychologists argue that it is unwise for 
counselors to "imitate the secrecy of the medical profession" in withholding 
scores from teachers and other laymen. These psychologists argue that laymen 
continually make |udgments about personality, and that if discouraged from 
using test scores they will base their judgments on casual observations of even 
less validity than the tests. What do you think? 

IDIOGRAPHIC ANALYSIS OF THE SINGLE PERSONALITY 
Criticisms of the Concept of “Trait” 

If a test is to assign a rank oi score to the individual, there must be a char¬ 
acteristic or dimension along which this score is located The desn e for scales 
analogous to those for size, temperature, and reaction time led psychologists 
to postulate that personality has dimensions or traits/A trait is a tendency to 
react m a defined way m response to a defined class of stimuli Traits are 
deeply embedded m Western languages, nearly all the adjectives which ap¬ 
ply to people are descriptive of traits happy, conventional, stubborn, and 
so on Traits are elusive m scientific analysis, however, and are defined and 
measured only at the risk of some 
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The postulate that traits exist is supported by three facts. 

• Personalities possess considerable consistency; a peison shows the same 
habitual reactions over a wide range of similar situations, 

• For any habit, we can find among people a variation of degrees or 
amounts of this behavior 

• Peisonalities have some stability, since the peison earning a certain 
score this year usually has a somewhat similai score next year 

These facts lead one to consider personality tiaits as habits, capable of 
being evoked by a wide lange of situations. It would be tedious to catalog a 
series of traits such as "habit of bowing politely when meeting a pretty 
woman of one’s own age on the street on Sunday,” “habit of bowing politely 
when meeting a not-pretty woman . . etc. Theiefoie traits are sought 

which describe consistent behavior m a wide range of situations, The tiait 
approach to peisonality hopes to describe economically the significant varia¬ 
tions of behavior, neglecting unduly specific habits Since the English dic¬ 
tionary offers no less than 17,953 adjectives describing tiaits, the problem of 
economy is a serious one. 

A trait is a composite of many specific behaviois To say that a boy is per¬ 
fectly honest predicts his behavior in any situation involving honesty. Ordi¬ 
narily, howevei, one possesses an intermediate degree of a tiait, he is honest 
in some situations hut not otlieis. Two people with the same scoie need not 
be alike in peisonality. Saying that a boy is “50 peicent honest” implies an 
even piobability of honest or dishonest behavioi. The descnption conceals 
the fact that he is perfectly honest with money and peifcctly dishonest when 
grades are the rewaid. A trait scale permits faultless mfeience only when 
all tho behaviors collected under the tiait definition aie present m the same 
peison, i.e, when all scores are 100 peicent or zero 

“The normal personality” presents troublesome pi oblems of measurement 
The deviate at either end of a trait distribution is well chaiactenzed by his 
score, he exhibits the trait in unusual degree and m a large number of situa¬ 
tions Intermediate scores tell the investigator little Yet eveiy normal 
peisonality has its unique chaiactcnstics Even a peison who is "noimafin 
all the traits we measuie has individuality Reducing his performance to a 
standaid list of tiaits on which he proves not to he exceptional, we lose pat¬ 
terns which make him different from his also-normal neighbors. 

Allport (1937, pp 248-257) criticized the entile trait approach on this 
giound One man may act fiom need; he will take money to feed his family, 
but will not cheat oi lie Anothei may be prudent rather than honest, he will 
be as honest as he must to avoid being caught Anothei may define honesty 
m a limited way, he would never steal, but he thinks it light to operate a 
business on the principle of “buyer beware ” These men are all honest to an 
intermediate degree, statistically, their honesty is near the average. 
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Since mapping a personality in terms of a few common traits does not rep¬ 
resent the way the individual’s behavior is organized, many investigators 
have tried to develop what Allport calls an “ldiographic” description An ldio- 
graphic analysis would define new traits as needed to fit each individual 
(eg., “shy with women of his own age m non-business relationships”) The 
difficulties that face such efforts are enormous, but some initial steps have 
been taken successfully. 

The trait appioach describes responses as if they were general over a very 
large class of situations "Dominant,” “paianoid-like,” and ‘honest” descube 
responses independent of particular situations The ldiogiaphic appioach 
looks for equivalences among situations. Sometimes student X shows 
dominance, sometimes not If we can find out what situations bring out 
dominant reactions —1 e, are equivalent for him—we can then hope to pre¬ 
dict his behavior with some exactness 

As a first step m studying situational equivalences, C. E. Osgood and 
G. A. Kelly have developed techniques foi studying peiception of the sig¬ 
nificant peisons m a subject’s life. These otheis are an important part of the 
person’s world, and many reactions are detei mined by his perception of 
them. 

28. Show how “stubbornness" might be present in some situations and absent in 
others for the same person, even though both actions are typical for him. 

The Semantic Differential 

Osgood’s method was developed for research on perception, meaning, and 
attitudes, lather than as a personality test (Osgood et al , 1957) Known as 
the Semantic Differential, it measures indirectly the connotations of words 
or objects The stimulus is rated on a seven-point scale, various scales and 
stimuli being mixed m random order. Successive items might appear as fol¬ 
lows. 


My Father soft_:-hard 

Fraud rich_•-poor 

Confusion fair_:_ unfair 

My Father deep_:_•-shallow 


In most studies Osgood and his students have been interested m specific 
stimuli (e g., “physicians,” “Presidential candidate A”) as perceived by a 
large group. For examining an individual, Osgood employs stimuli of per¬ 
sonal significance, e.g., “my father ” 

The subject is to check the scale rapidly, recording his first impressions 
Naturally it is difficult to defend any single response as “right” when judging 
communism, on the scale thin-thxck , but subjects have little difficulty in 



502 ESSENTIALS OF PSYCHOLOGICAL TESTING 

checking associations The scoring can be accomplished in two ways. Using 
factor analysis, the scales can be grouped into good-bad, strong-weak , and 
active-passive keys. Average scores can be assigned for each stimulus. Thus 
we could say that a subject has indirectly described his father at +1 on 
good (on a scale from +3 to —3), 2,4 on stiong, —0.4 on active. The other 
scoring method compares stimuli two at a time, converting the differences 
between their ratings into a “distance score” measuring the degree to which 
the subject perceives the stimuli as similar 

The best illustration of the technique is its application to a case of triple 
personality A dissociated personality is one in which the person possesses 
two or moie different “selves” and shifts back and forth between them (a bit 
like Dr. Jekyll and Mr Hyde) Eve White had three such identities who 
“took possession” at various times, and her therapists were able to administer 
the Semantic Differential to each self in turn (Thigpen and Cleckley, 1953, 
1957) In Figure 84 we present the configurations from two of the tests. The 


Love 



Confusion 


Hatred 


5pou*o 


Eve White 


Eve Slack 


FIG 8-4 Meaning systems of Eve White and Eve Black on the Semantic Differential (Osgood and 
Luna, 1954) 


black ball represents the midpoint on all scales. “Good” is at the top, “ac¬ 
tive” at the left, and “weak” toward the viewer. The solid line connecting the 
black ball with “doctor” (who is always good, strong, very active) helps to 
orient the figure. 

Two psychologists interpreted the patterns “blindly,” l e., with no further 
knowledge of the cases. Looking first at a few salient indicators, they pointed 
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out m Eve White’s recoid the separation of love and sex, the meaningless¬ 
ness of the spouse, the weakness of "me ” Eve Black seems to place hatred 
and fraud in a favorable cluster with "me” and rejects spouse, love, job, and 
child (The pattern of the third self, Jane, which will not be discussed here, 
is normal, love and sex are closely linked and favorable ) An impressionistic 
"guess” by the interpreter led to this summary of the first two personalities 
(Osgood and Luna, 1954) 

Eve White is the woman who is simultaneously most in contact with social real¬ 
ity and under the gieatest emotional stress. She is aware of both the demands of 
society and hei own inadequacies in meeting them She sees herself as a passive 
weakling and is also consciously aware of the discord m her sexual life, drawing 
increasingly shaip distinctions between love as an idealized notion and sex as a 
ciude reality. She maintains the greatest diveisity among the meanings of various 
concepts She is concerned and ambivalent about her child, but apparently is not 
aware of her own ambivalent attitudes toward her mother . Those psycho- 
analytically inclined may wish to identify Eve White with dominance of the 
superego, certainly, the superego seems to view the woild through the eyes of Eve 
White, accepting the mores 01 values of others (particularly her mother) but con¬ 
tinuously criticizing and punishing herself . . 

Eve Black is clearly the most out of contact with social reality and simultane¬ 
ously the most self-assured To rhapsodize, Eve Black finds Peace of mind through 
close identification with a God-like therapist (My Doctor, probably a father symbol 
for her), accepting her Hatred and Fraud as perfectly legitimate aspects of the 
God-like role Naturally, she sees herself as a dominant, active wonder-woman and 
is in no way self-cntical She is probably unaware of her family situation 
Like a completely selfish infant, this personality is entirely oriented around the 
assumption of its own perfection 

The pattern corresponds well with the therapists’ picture of Eve The 
therapists described the same personalities jn these phrases, among others 

Eve White demure, almost saintly, seldom lively, tries not to blame her husband 
for marital troubles, every act demonstrates sacrifice for her htde girl, meek, 
fragile, doomed to be overcome 

Eve Black a party girl, shrewd, egocentric, rowdy wit, all attitudes whimhke, 
ready for any little irresponsible adventure, provocative, strangely secure from 
inner aspect of grief and tragedy 

The correspondence of the portraits is remarkable. A single brilliant bit, 
however, is not to be regarded as adequate evidence of validity 

Mapping stimulus equivalences gives a different type of information from 
that of the trait-oriented questionnaire, but the semantic map does give in¬ 
formation about traits Eve White is unquestionably dissatisfied with her¬ 
self, perfectionist, unwilling to express emotion—any questionnaire would 
show a high score on introversion and hysteric tendencies. The Semantic 
Differential adds information about specific sources of conflict, lack of 
acceptance of spouse and sex, and her child’s weakness and need for protec- 
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tion. (Indeed, Mrs White’s feeling that she could not give her child ade¬ 
quate protection was a precipitating cause of her illness ) Eve Black is shal¬ 
low, uncontrolled, self-centered—extroveit on any questionnaire and on 
MMPI surely extreme on Ma and Pd This is apparent in the Semantic Dif¬ 
ferential association of me, hatred, and fiaud as good and strong. But the 
map gives the additional picture of stiong identification with men and re¬ 
jection of child and spouse One can judge what peisons and treatments are 
likely to win her respect and coopeiation, and what rewards she is likely 
to work for. Such information goes far beyond what one can get fiom even 
the most valid description of her general peisonality style. The rewards that 
a therapist might ordmarily offei—opportunity to hold a job, lestoration 
of marriage—were spumed by Eve Black. The coopeiation that eventually 
permitted some success in theiapy was won only when the therapist ap¬ 
pealed to Eve Black’s fear of sickness 
The Role Concept Repertory (Rep) Test of G A. Kelly (1955) is much 
like Osgood’s procedure, save that the subject himself now picks the scales 
on which he will respond This device leflects Kelly’s theoiy of social be¬ 
havior and psychotherapy, which places great stress on the way the person’s 
conceptualizations shape lus behavior. The principal aim of the Rep test is 
to obtain information useful to a theiapist 
The subject is given a list of about twenty roles, of which the following 
are representative. 

Your wife or present girl friend 
Your mother 

A person with whom you have woiked who was easy to get along with 
A girl you did not like when you were m high school 

The person whom you would most like to be of help to (or whom you feel most 
tony for) 

The subject names the people who fill these roles for him The examiner then 
selects three of the persons and asks, “In what important way are two of 
them alike but different from the third?” If the response is superficial 
(“These two are tall”) the examiner asks for some fuitliei similarity A useful 
response might be, “These two are self-confident and this one is shy ” The 
subject has then stated a bipolar scale along which he perceives people to 
differ. The procedure is continued until many scales have been elicited 
and applied to the significant others. 

29. Is the Semantic Differential fakable? Can one argue that it assesses unconscious 
attitudes? 

30. How might the Semantic Differential be used to study transference relations 
during psychotherapy? 

31. Osgood finds only three predominant factors among his scales Do three di¬ 
mensions appear adequate to describe one's perception of others? 

32. Is Osgood's test primarily behavioristic or phenomenological in outlook? 
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Suggested Readings 

Diamond, Solomon The factorial approach Personality and temperament. New 
York Haiper, 1957. Pp 151-183 

This review attempts to classify dimensions of personality by factor analysis, 
considers lists of possibly important tiaits and also gives an introductory ex¬ 
planation of factoi-analytic procedures 

Hathaway, Staike R , & Monachesi, Elio D. Personality chaiacteristics of ad¬ 
olescents as related to their later caieers II. Two-year follow-up on delin¬ 
quency Analyzing and predicting juvenile delinquency on the MMPI Minne¬ 
apolis. University of Minnesota Press, 1953 Pp 109-135 

This massive study of scoie patterns indicative of delinquency shows both 
the advantages and the disadvantages of analyzmg combinations of scoies 
The first and last chapteis of the book deal with the practical meaning of the 
reseai ch 

Meehl, Paul E, & Hathaway, Staike R The K factor as a suppiessoi variable 
in the MMPI I appl Psychol, 1946, 30, 525-564. (Reprinted m G S Welsh 
and W. G Dahlstiom (eds ), Basic readings on the MMPI in psychology and 
medicine. Minneapolis Umveisity of Minnesota Pi ess, 1956 Pp 12-40) 

Several theoietical aspects of the development of MMPI are discussed, in¬ 
cluding the authois’ reasons for not forming homogeneous clusters of items 
and the need for collections for bias in self-reports 

Schiele, B C, & Brozek, Jozef. “Experimental neurosis” resulting from semi¬ 
starvation in man Psychosom Med , 1948, 10, 31-50 (Repunted m Welsh and 
Dahlstiom, op cit Pp 461-483 ) 

In a study of MMPI changes dunng an expenmental stiess of six months’ 
duiation, nine cases aie described in detail, showing the relationship be¬ 
tween MMPI profiles and behavior patterns 

Ullmann, Cluiles A Teacheis, peers, and tests as predictors of adjustment J 
educ Psychol , 1957, 48, 257-267 

Evidence is given on the diffeience between information contained in teacher 
ratings and in self-report questionnanes Attention focuses on pupils who drop 
out of school oi who peiform pooily m school Items capable of identifying 
such students are listed 



17 


Judgments and Systematic 

Observations 


WHETHER an individual’s reputation corresponds to his behavior or not, it 
is unquestionably significant A person who has impressed his former teach- 
eis as imaginative is favored hy a college admissions committee Business 
and military organizations file supeivisois’ opinions and use them m decid¬ 
ing whom to promote. Teachers find out what children think of each other 
in older to understand relationships in the classioom and to identify social 
misfits Furthermore, as we have seen, latings aie an impoitant cntenon 
foi studying job performance and adjustment. In this chapter we shall con¬ 
sider problems and techniques of obtaining latmgs by superiors and by 
f peers (companions at the same level m the oigamzation). We shall then 
turn to systematic observations of behavior ^ 


RATINGS AND SOCIOMETRIC REPORTS 
Ratings by Supervisors 

Descriptions by supervisors (foremen, teachers, superioi officers, etc,) 
are haid to compare because styles of writing vary Rating scales are there- 
foie used to reduce impressions to manageable form A lating scale consists 
of a list of traits to be rated The form of the scale may vaiy fiom a simple 
list of adjectives to be checked to a continuous scale with several descrip¬ 
tive labels, as illustrated in Figure 85. Before evaluating specific forms of 
'rating scale, let us consider the chief difficulties to be overcome 
- Sources of Error. The first problem is generosity error, i e., the tendency of 
raters to give favorable reports The teacher, asked to indicate on a report 
card whether the pupil is cooperative, will usuallv rate all except the most 
troublesome pupils at the highest point on the scale Company commanders 
rate 98 percent of their junior officers m the top two categories (out of five) 

506 



Name of student 


A—How are 
you and oth 
er* affected 
by hi* ap 
pearance 
and man¬ 

ner? 

[ 1 Sought by other* 

I"! Well liked by others 

n liked by other* 

i 1 Tolerated by others 

n Avoided by others 

1 l No opportunity to 
observe 

PJeaie record here instances on which you base your judgment, 

B—Does he 

1 1 Seeks and sets for 

Please record here instances on which you base your judgment 

need fre- 

himself additional 

quent prod 

tasks 


ding or does 

i"~l Complete* sug- 


he go ahead 

gested supplemen 


without be- 

tary work 

ter 

mg told? 

1 1 Doe* ordinary as¬ 
signments of his own 
accord 

f~1 Need* occasional 

prodding 

1 1 Needs much prod¬ 
ding In doing ordi¬ 
nary assignment* 

□ No opportunity to 
observe 

C—Doe* he 

1 1 Displays marked 

: Please record here instances on which you base your judgment 

get others to 

ability to lead 


do what he 

his fellows, makes 


wishes? 

things go 

1 i Sometime* leads in 
important affair* 

! 0 Sometimes leads In 
minor affair* 

D I®** other* take lead 
! □ Probably unable to 
lead hi* fellow* 

□ No opportunity to 
observe 


D — How 

0 Unusual balance 

Please record here Instance* on which you base your Judgment. 

doe* he con- 

of responsive- 


trol his emo 

ness and control 


tlon*? 

1 1 Well balanced 
f~) Usually well 

balanced 

f~] Tends to Q Tend* 
be un- to be 

respon- over 

live emo¬ 

tional 

0 Unre 0 Too 

sponslve, easily 

apathetic de 

pressed, 

»rn- 

tated or 
elated 

1 1 No opportunity to 
observe 


^—Kas he a 

0 Engrossed In real- 

Please record here Instances on which you base your judgment 

program 

mng well formu- 


with definite 

lafed objectives 


purpose! In 

1 1 Directs energies 


terms of 

effectively with 


which he dls- 

fairly definite pro- 


tribute* his 

gram 


time and en 

Q Ha* vaguely formed 


«rgy» 

objectives 

0 Aim* ju*t to "get by" 
0 Aimless triflor 

0 No opportunity to 
observe 



FIG 85 The ACE Personality Report, Form B (Reproduced by permission of Ameri¬ 
can Council on Education ) 
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on efficiency reports Such ratings have little value because they do not dis¬ 
criminate between individuals There are several reasons for generosity er¬ 
rors; the ratei may feel that he is admitting poor leadership if he says that 
his subordinates are not peifoiming well, he tends to feel kindly toward his 
associates, he thinks he may have to justify any implied criticism, and he 
often finds it easier to say good about everyone than to pause to make 
careful discriminations. 

Ambiguity is a second difficulty Just as a self-report question on leader- 
slhp"can he variously interpreted, so a latei may'define leadeiship in many 
ways To one judge “leadership” suggests conscious wielding of authority, 
crisp decisions, and general dominance A peison rated high by this judge 
would receive a lower rating from a judge who looks foi a leadei to encour¬ 
age subordinates, bring out cooperative decisions, and suboidinate his own 
views to the decision of the group 

The later is usually mstiucted to mark one of several alternative scale posi¬ 
tions, and these response positions may also be ambiguous In some of the 
early rating scales the respondent was asked to rate “Friendliness,” for ex¬ 
ample, on a sqale from 0 to 100. No paitieular definition can be given for a 
number such as 85 on that scale, and the same score may indicate quite dif¬ 
ferent behavior to different raters. Such woids as average and excellent are 
equally indefinite They should be replaced by specific descriptions of be¬ 
havior. 

Judges have constant errors or biases A constant error can be identified 
when two judges rate the same individuals. If the judges’ averages differ, 
they are observing diffeient aspects of behavior or are defining the scale dif¬ 
ferently. Generosity is one such constant error The response styles men¬ 
tioned m connection with achievement tests and personality questionnaires 
are also observed in ratings, e g., one judge rarely uses the extremes of the 
scale m describing subjects, whereas another describes most peisons m 
black-and-white terms. 

A further source of diffeiences between judges is that each has limited in¬ 
formation about the individual Since a physical education teacher ancTari 
English teacher see entnely different sides of the student, their ratings on 
initiative, imagination, oi leaction to fiustiation will disagiee. Even when an 
observer sees an individual in a gieat vanety of situations, his sample of 
behavior is still limited The supei visor can base his ratings only on what 
the man does under his supervision, and this may not be at all repiesentative 
of his work elsewhere, 

The so-called ha lo e ffect is an error which obscures the pattern of traits 
within the individual The observer foims a general opinion about the per¬ 
son’s merit, and his ratings on specific traits are strongly influenced by this 
overall impression Even productivity may be rated erroneously because of 
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the influence of a pleasing or displeasing personality Halo is responsible for 
the substantial correlations shown m Table 66 among latings given to 1100 
industrial employees The ratings on quite dissimilar tiaits show a marked 

TABLE 66. Intercorrelations of Ratings Given 1100 Industrial Workers 
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52 

81 

.47 
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80 

71 

78 

82 

68 

63 

50 

Accuracy 

63 

85 

80 

.45 

81 

67 

80 

78 

84 

.74 

.70 

.84 

Productivity 

55 

79 

72 

81 

.46 

86 

86 

80 

81 

81 

73 

45 
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60 

82 

80 

67 
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85 

83 
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80 
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.49 

78 
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.47 
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67 
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Initiative 

54 

78 
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.78 
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82 

.48 

86 

72 
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77 

Judgment 

62 

80 
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81 

.88 
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86 

.45 

.76 

75 
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61 

67 
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74 
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.37 
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52 

Personality 

55 

67 

63 
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67 
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80 

.39 

71 

Health 

.25 

.52 

50 

84 

45 

.60 

53 

77 

43 

.52 

.71 

.36 


a Boldface figures show correlations of two raters' judgments on each worker, l e , reliability of 
judging 

Source Ewait ct al , 1941 


general factoi, apparently corresponding to the foremans opinion of the 
man’s industriousness and productivity 

These sources of error have four undesirable consequences. 

• Ratings may not reveal important individual differences because they 
pile up at the favorable end of the scale, 

« Ratings may be seriously invalid, lepresenting chance effects or traits 
other than the one supposedly rated Psychologists rating intelligence on 
the basis of observation, for example, ovenated the ability of men with more 
introspective, less outgoing personalities (Banon, 1954) 

• Halo effect obscuies the descriptive picture. 

« Ratings by diffeient judges disagree Evidence of unreliability is seen 
m Table 66 Reliability of rating is greatest for behaviors which can be clearly 
specified and for traits which are descriptive rather than interpretative. 
Traits rehably rated include talkative, assertive, bashful, and cultured Relia¬ 
bility is lowest for general, vaguely stated attributes such as adaptable, sen¬ 
sitive, and kindly (Hollmgworth, 1922, Mays, 1954). 

Improvement of Ratings. The problems m improving ratings are similar to 
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the problems in improving self-reports Again, the tester must assume that 
the respondent will give false information if he thereby gams psychological 
rewards. To be sure, the information affects the subject’s future rather than 
the rater’s, but this does not mean that the rater is uninvolved We have 
mentioned the rater’s inclination to mterpret reports on the subject as a re¬ 
flection of the adequacy of his own teaching or supervision Bias is even more 
certain when the therapist’s ratings are used as a criterion of personality 
change during psychotherapy Sometimes a rater gives a low rating be¬ 
cause he wishes to retain an employee who might be promoted if he got a 
high rating. A teacher who rates a scholarship applicant may enlarge upon 
lus merits nearly to the point of perjury in order to help the student. 

Selection of raters is the first point at which to improve ratings. Raters 
cannot give valid information unless they know the subject well. Other 
tilings being equal, those in immediate contact wuth the subject can give 
better information than those who rely on hearsay A high-school teacher 
usually can give more dependable information on a pupil’s work habits and 
social behavior than can the principal 

One elementary precaution, often overlooked in practice, is to include in 
the rating blank a question regarding the extent of the rater’s acquaintance 
with the subject and the kinds of situation in which observations were made, 
and a space where the rater can indicate “insufficient opportunity to ob¬ 
serve” each trait instead of making an estimate fiom inadequate information 
The American Council scale (Figure 85) not only provides a space to indi¬ 
cate lack of information but requests specific evidence for each rating so that 
the reader can judge for himself whether a favorable rating is justified by 
the rater's knowledge If a judge is directed to mark every trait, some latings 
are little better than guesses, Conrad (1932) directed raters to star traits 
which they legarded as especially important in the child’s peisonality. Inter¬ 
judge correlations on all ti aits ranged from 67 to 82 But for the traits which 
three judges agreed in starring, the ratings correlated as high as .96. 

When the same judge is used repeatedly, it may be possible to keep a 
record of his ratings and ultimately to estimate his constant eiror. For exam¬ 
ple, a college learns to allow for the fact that one high school has a “tough” 
grading or rating policy, whereas another school is lenient It is rarely practi¬ 
cal to make exact statistical corrections for such diffeienccs between raters. 

One can raise the reliability of ratings by combining impressions of several 
judges. If, as in Table 66, the reliability of a rating is about 45, the average of 
two independent judges is expected to have a reliability of 60 and the aver¬ 
age of five judges a reliability of 80 (These results are given by the Spear¬ 
man-Brown formula, p, 131 ) In the average the bias of one judge tends to 
cancel the bias of another, and each adds information the other had no op- 
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portumty to observe, Reliability may be lowered rather than raised, however, 
when the additional judges are only remotely acquainted with the subject 
Careful preparation of the rating scale is of great value There is some ad¬ 
vantage m using several questions dealing with a particular aspect of per¬ 
sonality, just as there is advantage in basing self-report scores on many 
related items On the other hand, anything which enlarges the rater’s task 
invites perfunctoiy answers 

The rater may be asked to make a simple checkmark beside satisfactory 
qualities, respond on a numerical scale, or make choices among carefully de¬ 
scribed alternatives Where ratings on each trait are to be considered sepa¬ 
rately the last of these forms, known as the descriptive graphic rating scale, 
is generally best (cf. Figure 86) The scale is descriptive, since each point 

Is he abstracted or wide awake? 


Continually 

Frequently 

Usually 

Wide¬ 

Keenly 

absorbed in 

becomes 

present- 

awake 

alive and 

himself 

abstracted 

minded 


alert 

(5) 

(4) 

(2) 

(1) 

(3) 

Is he shy or bold in social relationships? 



1 

Painfully 

r 

Timid, 

1 

Self-conscious 

1 

Confident 

1 

Bold, 
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self-conscious 

Frequently 

on occasions 

in himself 


embarrassed 



social feelings 

(4) 

(2) 

(1) 

(3) 

(5) 

How does he accept authority? 




1 

Defiant 

.1 

Critical of 

1 

Ordinarily 

1 

Respectful, 

1 

Entirely resigned, 


authority 

obedient 

Complies 

Accepts all 




by habit 

authority 

(51 

(4) 

(31 

(1) 

(2) 


FIG 86 Items for the Haggerty-Olson-Wickman Behavior Rating Schedule (Copyright 1930 by 
World Book Company and reproduced by permission ) 

corresponds to a recognizable behavior pattern. It is graphic, m that the rater 
is allowed to mark at intermediate points if he does not find any one of the 
descriptions entirely suitable In general, 5- to 7-pomt scales seem to serve 
adequately With informed and serious professional judges, much finer sub¬ 
divisions of the scale prove profitable (Champney and Marshall, 1939) 
The 5-pomt scale obtains more discrimination than the “yes-no” checklist 
A judge will ordinarily say “yes” when asked “Does the subject have good 
judgment?” but if given several alternate choices he may check “Sometimes 
overlooks relevant facts m making decisions.” The 5-pomt scale also has the 
advantage of drawing attention to various kinds of deviation. A simple "yes- 
no” question, “Does he accept authority?” would not distinguish, as does the 
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Haggerty-Olson-Wickman scale, between the respectful, obedient child and 
the slavish, spiritless conformer. 

To predict an external criterion, traits for lating can be selected empiri¬ 
cally. The Haggeity-Olson-Wickman scale is intended to screen maladjusted 
pupils for psychological study A direct interpretation might be made simply 
by scoring socially desiiable behavior. The investigators found, however, 
that behavior which on its face seems desirable may actually be a sign of 
maladjustment, appearing more often among pioblem children than among 
pupils in geneial. Weights were therefoie assigned to each response, as indi¬ 
cated by the numbers in Figure 86 A scoie of 1 was given for descriptions 
rarely applied to problem children, and a score of 5 for lesponses character¬ 
istic of the problem group We see, for example, that “Wide-awake” is a fa¬ 
vorable description, but that “Keenly alive and aleit” describes children who 
get into trouble about as often as it does the well adjusted 

This weighting technique suggests the possibility of concealing the scoring 
plan to outwit the rater who is unwilling to give an unfavoi able report. One 
might count only those ratings which con elate with the entenon. For exam¬ 
ple, m selecting salesmen one might give ciedit foi high latmgs such as en- 
eigetic, ambitious, and fiiendly (if these tiaits con elate with success in the 
job) and no credit for equally high ratings such as hard-woikmg, well-ad¬ 
justed, and cooperative (if these baits have no piedictive value). Indeed, 
we may go faither, and assign a negative weight to these irrelevant favorable 
ratings to compensate foi ratei generosity. 

Forced-Choice Methods This idea undeilies the forced-choice method of 
merit rating pioneeied by the militaiy services. Periodic ratings of each of¬ 
ficer by his superioi are lequued for use in promotion and reassignment. The 
tradition of giving favorable ratings, however, means that conventional iat- 
mg forms bring in almost no infoimation Psychologists therefore invented a 
forced-choice scale As a first step in making such a scale, superiors aie asked 
to describe men by checking a list of phrases. A follow-up is then made to de¬ 
termine which men perfoim best in subsequent assignments, and for each 
adjective or phrase two figures arc obtained a favorabihty index and a valid¬ 
ity index. A favorable-valid item is one which ratcis apply frequently and 
which predicts success an unf avoi able-valid item is rarely applied and when 
applied forecasts failure. Invalid items aie those not associated with success 
or failure. 

The forced-choice item is then developed One technique used by Army 
psychologists employs two pairs of statements A favorable-valid item was 
matched with a favorable-invalid item, and an unfavorable-valid item with 
an unfavorable-invalid item These four were presented together, the rater 
being instructed to indicate the one statement which best describes the man 
and the one which least describes him Thus the rater is forced to make at 
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least one unfavorable statement, and to choose only one favorable statement. 
An item might consist of the following alternatives- 

Wins confidence of his men (Favorable-valid) 

Inclined to gnpe about conditions (Unfavorable-invalid) 

Punctual in completing repoi ts (Favorable-invalid) 

Has weak tactical judgment (Unfavorable-valid) 

The response is scored by assigning a plus credit for each favorable-valid 
choice and a minus ciedit for each unfavorable-valid choice 

The aim m the forced choice is to separate the rater’s task of describing 
what the individual does from the task of evaluating what he does (Richard¬ 
son, 1949) The responsibility for description must lest on the rater, but eval¬ 
uation is left to the decision maker 

The score indicates the man’s probable merit For a combat command, it 
appears likely that winning confidence is more important than punctuality 
in lepoits, and tactical judgment more important than contentment The 
scoring weights, however, aie assigned on the basis of statistical evidence, 
not on the basis of judgment The weights are kept seciet from the raters, but 
since raters can guess to some extent how the scale will be scored, the 
choice is only relatively free from distortion 
Highland and Berkshire (1951) compaied several types of forced-choice 
instrument for rating instructors Whereas a graphic rating scale correlated 
only 40 with rankings, validities for foiced-choice scales ranged from .53 to 
62. The most valid form presented four favoiable traits, two relevant to the 
critenon and two irrelevant, with the rater instructed to mark the two most 
descriptive of the mstiuctor Such a form can be distorted by a desire to give 
“good” ratings, but only to a limited degree When supervisors filled out the 
scale a second time with instructions to give as favorable an impression as 
possible, the median of the “faked” scoies fell near the 67th percentile of the 
“honest” distribution As Figure 87 shows, the bias raised many scores from 
the “bad” end of die scale to the average but did not lead to a piling up of 
very high scores It is evidently possible for the rater to avoid giving a bad 
impiession on this type of scale, but not to fake a very good one. Raters pre¬ 
ferred die form using all favoiable traits to the one using two favorable and 
two unfavorable baits. The latter was also more subject to distortion, since 
“faked” scores did pile up at the high end of that scale 
Raters are generally antagonistic to forced-choice techniques They want 
to know how their repoi ts will be interpreted and want to be free to give an 
entirely favoiable impression Whether a forced-choice scale can be used in 
a given situataon depends upon the coopeiation the data gatherer can antici¬ 
pate or upon the authority he can bring to bear The Army, after developing 
the technique and establishing its validity, concluded that resistance from 
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officers was too great to justify continued use of forced-choice scales in effi¬ 
ciency reports. It has continued to use forced choice, however, in self-report 
forms. In industry, forced-choice merit ratings have had considerable appli¬ 
cation. 

A method of restricting raters which encounters less opposition is to re¬ 
quire rankings. Wheie large groups of men are to be judged, the instructions 



FIG 87 Distribution of ratings on a forced choice scale under normal and faking 
conditions (Highland and Berkshire, 1951) 


may call, not for complete lankmg, but for dividing men mto groups such as 
top 5 percent, next 20 percent, middle 50 percent, next 20 percent, and bot¬ 
tom 5 percent. This forced distribution obtains more differentiation under 
some circumstances than does the graphic scale Ranking presupposes that 
the judge is giving consideiation to the proper trait, a ranking on overall 
merit will be misleading if the ratei stresses obedience and dependability 
when the institution wishes to select men with initiative and imagination. 
The chief limitation of the ranking method is that gioups are rarely compara¬ 
ble, so that a top man m one group might rank tenth in another.' 

The “(Tsort” technique developed by Stephenson (1953) is valuable for 
certain purposes. In comprehensive peisonality assessment, for example, in¬ 
terviewers and observers may collect a great deal of information and arrive 
at a compiehensive picture of the man’s strengths and weaknesses. Much 
of this information is lost if it is reduced to a few simple numerical ratings 
In a descriptive report, the psychologist sometimes makes general remarks 
which might apply to any subject and omits some impoitant findings, 

Stephenson’s method calls for the preparation of a set of phrases covering 
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the aspects of personality or performance that concern those who will use the 
report There is no single list of statements for (^-sorting, since the traits to 
consider in selecting executives may differ a good deal from the descriptions 
useful m appraising patients during psychotherapy The following state¬ 
ments are representative of a list used by assessors for evaluating superior 
men (Block, 1957) 

Communicates ideas clearly and effectively 
Is rigid, inflexible in thought and action 
Takes an ascendant role in his relations with others 
Is masduline in his style and manner of behavior 
Lacks insight into his own motives and behavior 

Overcontiols his impulses, is inhibited, needlessly delays or denies gratification 
Allows personal bias, spite, or dogmatism to enter into his judgment of issues 

The statements or phrases aie written on separate cards The rater is told 
to soit the cards into eleven piles, with those most descriptive of the subject 
in the first pile and those least descriptive m the eleventh. The rater must 
place a specified number of items in each pile, if theie are 100 statements to 
be sorted, he might be told to put them mto this distribution 

Most descriptive Least descriptive 

Pile 1 2 3 4 5 6 7 8 9 10 11 

Number of cards 2 4 8 11 16 18 16 11 8 4 2 

The number of piles and the number of cards differ m different studies 
The sorting procedure has some advantage over the usual rating form, 
since the rater can shift items back and forth In the usual inventory or 
checklist his definition of a category such as “Definitely tiue” may shift while 
he is making his ratings, but m a (3 sort we may expect the items placed in 
the same pile to be truly comparable The fixed distribution eliminates rater 
differences in response style It cannot, however, eliminate rater bias The 
rater can easily arrange the items so as to describe the subject favorably (Ed¬ 
wards, 1957) The (1-sort method may also be used m obtaining self-descrip¬ 
tions. 

Q-sort data can be handled in several ways One may compute the median 
position of statements representing a single dimension of personality, just as 
a personality test is scored for anxiety or dominance One may develop an 
actuarial key for items predictive of a criterion, as in forced-choice rating 
scales One may compute a correlation showing how similar one subject is to 
another Some of the elaborate techniques used with Q sorts are open to seri¬ 
ous criticism (Cronbach and Gleser, 1954). Properly designed Q statements, 
however, have unquestionable value for obtaming complex descriptions 
which can be systematically compared. 

The choice among ratmg techniques depends upon the purpose of rating, 
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the qualifications of raters, the information they have about the subjects, and 
the likelihood of distortion, deliberate or unconscious. The short, unsubtle 
but carefully prepared descriptive giaphic rating scale is probably best when 
each subject is rated by different individuals and one may assume a reason¬ 
able degree of honesty m rating. Ranking is advantageous when a single 
judge gives infoimation on the complete group or a 1 epresentahve sample of 
the group. The foiced choice is often supeiior when latings are used for in¬ 
stitutional decisions regarding selection or classification but is less suitable 
for guidance or description of the individual The Q soit is of greatest value 
where a comprehensive description of a single individual is desired and the 
rater can be expected to give patient consideration to a long list of questions. 
Asking the rater to fill out a standard personality questionnaire so that the 
responses describe the subject has similar advantages 

1. Which rating technique would be most suitable for each of these purposes’ 

a. Obtaining ratings from principals to be used in deciding which teachers 
should receive salary increases for special merit 

b. Obtaining information for school records regarding parents’ impressions of 
their children's personalities 

c. Maintaining weekly records of ward behavior of patients as seen by at¬ 
tendants 

d. Recording teacher characteristics as judged by an observer in research 
evaluating teacher-training methods 

e. Obtaining reports from supervisors of student teachers, to be used by 
campus instructors in helping the student to improve 

f. Obtaining reports on pupils to be used in awarding college scholarships to 
the most deserving graduates in a state. 

2. Why might keenly alive children have more behavior problems than those rated 
as wide-awake (Figure 86)’ Would "keenly alive and alert" ordinarily be con¬ 
sidered a sign of poor mental hygiene? 

3. In the American Council rating scale, the trait scale for leadership (C) is defined 
by five specific phrases What advantage does this scale have over the set of 
adjectives "excellent," “good," "average," “poor," “unsatisfactory"’ 

4. Why might integrity and kindness be especially hard to rate reliably? 

5. Which of the following traits would probably be hardest to rate reliably after 
observations skill in self-expression, freedom from tension, freedom from anxi¬ 
ety, leadership (Hollmgworth, 1922, p 32)? 

6. Ratings on leadership made at Officer Candidate School correlated only 15 
with ratings on efficiency of combat leadership by superior officers who observed 
the men in combat (Jenkins, 1947). Why is the correlation so low’ 

7. Could a complex description be obtained by having the rater mark the MMPI 
responses that fit the subject’ Would such a method be less satisfactory than a 
Q sort’ 

8. The rating form shown in Figure 88 is used by high schools to send reports to 
colleges Compare this form with the ACE scale (Figure 85) with respect to 

a format. 

b. traits covered 

c. adequacy of phrasing of scale positions 
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Validity of Ratings by Superiors It is extremely difficult to state whether, m 
a given situation, ratings by superiors will be valid measures of behavior 
One might expect supervisors to rate job knowledge accurately. The trait 
is well defined, the behavior is observable, and the supervisor has ample 
opportunity to observe Nonetheless, supervisors’ ratings of job knowledge 
usually con elate only about 35 with the knowledge measured by a formal 
test, though ratings m one department reached a validity of 55 (Peters and 
Campbell, 1955, Morsh and Schmid, 1956) 

Another study investigated ratings given by department heads to foremen 
These ratings correlated only .22 with objective records of the work perform¬ 
ance of the crews The rating supposedly reflected productivity but it actu¬ 
ally correlated 59 with how long the rater had known the foreman, and 65 
with his liking for the foreman (Stockford and Bissell, 1949). Such findings 
are particularly distressing in view of the widespread use of ratings as criteria 
for validating tests. 

Although the evidence demands that one be suspicious of the validity of 
ratings, they are sometimes excellent sources of data Jack (1934) found that 
ratings of “ascendance” by nuisery-school teachers correlated 81 with a score 
derived from objectively recorded observations of the child’s acts on the 
playground For ratings to be depended upon, the validity of the rating 
procedure should be established m the particular situation where it is used 

Peer Ratings 

In many situations ratings by peers give more useful information than 
ratings by supenois Even where ratings by superiors aie available and de¬ 
pendable, the peer ratmgs cover a different aspect of personality A “peer” 
is an individual who has the same status within the organization as the per¬ 
son lated Black’s study in which girls rated others living in the same college 
dormitoiy is one example Another is the rating of each other by officer can¬ 
didates In military studies, such reports are often refened to as “buddy rat¬ 
mgs.” 

Whereas only one or two supenors know a subject well, ten to thirty rat¬ 
ers may give information when latings in a class or a dormitory are collected 
As a consequence, the average lating on any trait is highly lehable. Indeed, 
for well-defined traits in a group which has had reasonable opportunity to 
become acquainted, composite peer ratings generally have reliabilities m 
the neighborhood of 90. 

A child who impresses his peers as being a leader may not be the one 
whom the teachei regards as a leader; the peers, for example, may place 
great weight on popularity whereas the teacher notices originality and initia¬ 
tive. It is of value for the teacher or counselor, however, to know which per- 
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sons are regarded as leaders by their own group. Indeed, the information 
may be most significant in just those cases where the superiors and the peers 
have different impressions The peer rating is an objective statement about 
the individual's reputation. Reputation is based to some extent on behavior, 
but the social pattern and role relations m the group introduce biases of vari¬ 
ous sorts. Among adolescents, coirelations between reputations and careful 
observations of corresponding behaviors range from 45 to 70 (Newman and 
Jones, 1946) 

To obtain peer ratmgs it is usually necessary to simplify the task Raters 
are untrained, and we desire each rater to describe many individuals The 
adjective checklist (see p 477) can be marked much more quickly than the 
descriptive giaplnc scale and can covei many aspects of behavior In using 
the checklist to obtain information about particular individuals, related ad¬ 
jectives are classified into groups and a count is made of the frequency with 
which adjectives m each category are checked Such a checklist leads to a 
descriptive profile 

Nomination Techniques If thirty persons in a group late each other on 
twenty traits, each person is being asked to give 600 responses This means 
that considerable carelessness and halo effect may be expected, and various 
alternative devices are employed to reduce the labor without reducing the 
amount of significant infoimation The most important of these devices is 
die nomination technique Each member of the group is asked to name a 
fixed number of persons who are outstanding m a particular respect, such as 
leadership A similar nomination of peisons who are most lacking m leader¬ 
ship may also be solicited, but this arouses anxiety because subjects know 
that they are being considered for such unfavorable nominations and be¬ 
cause, as raters, they are reluctant to speak unfavorably of associates The 
data gatherer can usually infer that the person who is never mentioned for a 
certain favorable trait belongs somewhere toward the other end of the scale 

For young children, Haitshorne and May disguised die nomination tech¬ 
nique as a guessmg game The “Guess Who” test describes various roles chil¬ 
dren may play, and each member of the group names the children he thinks 
each description fits. Typical descriptions are (Hartshorne and May, 1929, 
p. 88): 

Here is the class athlete He (or she) can play baseball, basketball, tennis, can 
swim as well as any, and is a good sport 

This one is always picking on otheis and annoying them 

A profile for each child is made by counting the frequency with which he is 
mentioned for each description 

Sociometric Ratings. The sociogram is a method of studying the social 
structure of groups, Characteristics of an individual, including his popularity, 
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may be studied by the Guess Who method, but the sociogram gives further 
insight by identifying cliques, hierarchies of leadership, and other social 
groupings The sociogram was developed by Moreno (1934) Although the 
technique has been amended m various ways which sacrifice effectiveness 
for convenience, the best pioceduie is to lequest members of a gioup to indi¬ 
cate their choices for companions m a particular activity. Gronlund (1959) 
suggests the following directions for use with pupils in the upper elementary 
grades 

During the next few weeks we will be changing our seats around, working m 
small gioups, and playing some gioup games. Now that we all know each other by 
name, you can help me anange gioups that woik and play best together, You can 
do this by wutmg the names of the childien you would like to have sit near you, 
to have work with you, and to have play with you You may choose anyone m 
this room you wish, including those pupils who are absent Your choices will not 
be seen by anyone else Give fii st name and initial of last name 

Make your choices caiefully so the groups will be the way you leally want 
them. I will try to arrange the gioups so that each pupil gets at least two of his 
choices Sometimes it is haid to give eveiyone his fust few choices so be sure to 
make all five choices for each question 

Dnections should be concerned with leal gioup activities, and the choices 
should he real choices The data are not obtained in a test setting, instead, 
they are obtained as a means of dealing with the group. If data are obtained 
from a less real question, such as “Who aie your fi lends?” there is more likeli¬ 
hood of answers given to make a good impression. Subjects must know that 
their reports will be treated confidentially. The sociometric data should be 
used as promised to set up work gioups, committees, homeroom seating, or 
whatever; this permits one to obtain coopeiation when the technique is used 
again at a later date. 

Though sociometric ratings are easy to obtain m most situations, the testei 
must be wary of arousing anxieties In a group of adolescent girls where pop¬ 
ularity is a mattei of gieat concern, a gul may resist the injunction to indi¬ 
cate the one person she most pxefers, oi may woiry about how she will be 
rated. 

After the choices are obtained, they are plotted m a sociogiam. Figure 89 
is the sociogiam of a class of fouith-giadc gnls early m the school year Pu¬ 
pils indicated one to tluee choices and wcie permitted to list also any class¬ 
mates they would not choose. This sociogram shows scveial typical configura¬ 
tions There are two gioups or cliques. In one Emily is the most-sought-after 
peison, with Jane, Lenora, Caroline, Hhoda, and Louise as accepted mem¬ 
bers In the other group, Agnes is the key figure, with Lurline, Patricia, and 
Ann as members Patricia is not thoi oughly mtegi ated with the clique, while 
accepted by Agnes, she is also reaching toward Emily m the other group, 
rather than Lurline or Ann Agnes, who might be a popular leader of all the 
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girls, instead shows considerable hostility, rejecting three popular girls. Ella 
is not chosen by any of the others, and Tess is even more isolated 
The sociogram obtained depends upon the question asked For example, 
if sorority girls are asked to indicate their choices foi roommates, and then 
choices of persons with whom to study, the sociometric patterns will differ 



FIG 89 Sociogram for a class of fourth grade girls (Adapted from Staff, Division on Child 
Development, 1945, p 297) 


A best friend may be thought of as too noisy or untidy for a good roommate, 
and an unpopulai girl may be regarded as an excellent helper on school as¬ 
signments. Basic social configurations aie fairly stable when different ques¬ 
tions aie used, but one cannot assume that the interpersonal structure of a 
group is the same under all conditions 

The stiucture changes with time By December, Agnes was a “star” in her 
class, along with Rhoda and Emily. The cliques had disappeared, thanks to 
the skill of the teacher Ann and Lurhne still chose each other, but Agnes 
now turned her back on them, ignoring Ann and rejecting Luihne Even 
though social relationships change, an individual’s level of popularity is re¬ 
markably constant. Gronlund (1959) points out that among elementary pu¬ 
pils the stability over a one-year interval is about as high as for intelligence 
and achievement. 

The term sociometric rating applies generally to all methods of identifying 
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social relationships among group members No sharp distinction is to be 
made between the descriptive peer rating and the sociometric rating, but in 
general the latter is restricted to questions about whom the rater likes best 
or would prefer to work with, and thus is as much concerned with the rater’s 
reaction as with the rater’s personality. When given willingly, however, the 
ratings may be much more dependable than other sorts of ratings, as Linfl- 
zey and Borgatta point out (Lmdzey, 1954, Vol. I, p 406) 

There is no need to tram raters to engage m sociometric ratings The 
difficult and time-consuming task of attempting to produce common 
frames of reference and homogeneous critena m terms of which ratings 
shall be assigned is avoided The rater is asked to apply exactly those 
particular, unique, and sometimes irrational critena he has spent a life¬ 
time developing. Everyone is an experienced or expert rater when it 
comes to sociometric judgments Each of us has a vast body of experi¬ 
ence in deciding with whom we wish to intei act and whom we wish to 
avoid Liking and disliking, accepting and rejecting are part of the proc¬ 
ess of daily living. . One mighl say that the individual who uses 
these techniques is taking advantage of the laigest pool of sensitive and 
expenenced lateis that is anywhere avilable 

The validity of responses to the sociometnc questionnaire is attested by the 
finding that choices given by pupils as to preferred fellow actors in a class 
play correlated about 80 with actual choices when an opportunity to present 
lrtipiomptu plays was given (Byrd, 1951) 

9 What children besides Tess and Ella are fringers? 

10. What interpretations of Agnes' hostility can be suggested? 

11. Prior to this study, the teacher had characterized Tess as hard working, inter¬ 
ested in accomplishing tasks, "fits in nicely with the group.” Tess helps ^thers 
with their sewing, at which she is superior How would the teacher’s outlook 
and treatment of Tess be affected by the information from the sociogram? 

12. The following choices were made in a group of tenth-graders Plot a sociogram 
and discuss the interactions shown 

Shirley chooses Charles, Jim, and Sam 

Charles chooses Shirley, Sam and Jim; rejects Tom 

Phil chooses Jim, Charles, and Shirley, rejects Wallace and Tom 

Wallace chooses Phil and Jack, rejects Tom. 

Jim chooses Jack, Sam, Charles, and Shirley, rejects Tom 
Jack chooses Jim and Tom, rejects Phil 

Shirley is chosen by several girls whom she does not mention Sam and Tom 
were absent 

13. When sociograms were made of squadrons of Navy fliers on combat duty, it 
was found that the “administratively designated leaders” were often not the 
ones chosen as preferred work leaders by the men (G A Kelly, 1947, p. 133). 
What practical suggestions follow from this finding 9 
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14 In a group of sorority girls, the sociometric question “Whom would you choose 
as a roommate?” is asked, will results be the same if the question is changed 
to “With whom would you choose to go on a double date’" 

15 What would be the best way of studying the reliability of sociometric data 7 

Uses of Peer Ratings Information about the individual's reputation or status 
in the group can be used m many ways The group leader uses it to identify 
individuals who require special attention and individuals who can be devel¬ 
oped mto leadeis Sociometric information has frequently been of use in re¬ 
organizing a group so that it will function better Foi example, Roethlisber- 
gei and Dickson used sociometric data to organize factory woikeis into con¬ 
genial teams Moreno, in an institution for delinquent gnls, assigned the girls 
to small living groups on the basis of sociometric choices 

Peer ratings can be used as a basis for selection and classification Among 
officer candidates, for example, the impression a man makes on his compan¬ 
ions during the early phases of training is likely to forecast his ability to win 
confidence and acceptance as an officei. One study (Wherry and Fryei, 
1949) refers to the peer rating as the “puiest measure of leadership 
better than any other variable.” Kelly and Fiske (1951, p 169) found that 
peer latmgs of clinical psychology trainees after only a few days’ close asso¬ 
ciation weie significant piedictors of ratings of clinical competence made by 
umveisity departments three years latei The median correlation of .25 is 
only a little below the coefficient of ,34 for latings by a team of trained psy¬ 
chological assessors using full test and interview data Neither validity coef¬ 
ficient is high, partly because of the madequacy of the cuteria. Similarly, 
composite peei latmgs of officei candidates coirelate about 50 with latei rat¬ 
ings by superiors m duty assignments This correlation is extremely lmpies- 
sive, in view of the criterion leliability of about .50. The rated traits which 
predict the critenon include cooperative, emotionally stable, assertive, intel¬ 
lectual, and determined (Tupes, 1957) 

For the counselor, the peer descriptions point out characteristics of the in¬ 
dividual which impede his acceptance Especially when the student seems 
to lack insight legaldmg his reputation, the peer rating pomts to behavior 
which should be examined dunng counseling. 

16 Cattell and Slice (1953) find that surgency (le, energetic, talkative, en¬ 
thusiastic behavior) correlates very little with leadership behavior as rated by 
observers, but correlates substantially with frequency of election. Explain this 
finding What does it imply regarding the use of peer ratings as criteria? 

Noteworthy Rating Scales 

Few latmg scales are distributed commercially, since the common prac¬ 
tice is to develop a new instiument for each institution or each investigation 
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Some scales have been carefully designed and standardized for common re¬ 
search and admmistrative puiposes. Figures 85 and 88 show scales for use by 
high schools m making recommendations about college applicants. Seveial 
scales for industrial ment rating have also been published oi distributed 
thiough management consulting firms Here we shall examine selected scales 
for two other uses 

The lating of personality has particularly widespread application Both foi 
practical personnel decisions and for reseaich, we wish to record the im¬ 
pressions of peers and supervisors A suitable set of scales will include traits 
which can be rated reliably and which are relatively definite and free from 
halo effect 

Scales foi waid obseivations and clinical diagnosis emphasize symptoms. 
In large hospitals, foi example, it is useful for ward attendants to fill out such 
foims periodically on each patient, since shifts m obseived behavior may im¬ 
ply that the patient’s treatment should be alteied Several scales for latmg 
patient behavioi have been developed, one being the Wittenboin Psychiat¬ 
ric Rating Scales published by the Psychological Coipoiation (1955) (See 
also Lon et al., 1955 ) The Wittenboin form presents 52 scales, organized 
into nine scoies representing diffeient types of symptoms Figuie 90 shows 
the ratings of a patient on the first five items The clusteis I to IX were de¬ 
fined thiough factor analysis, and the unshaded block in the rating form in¬ 
dicates that a particulai response is relevant to one of the dimensions. The 
ratei need only circle the number (0, 1, 2, or 3) which indicates bis lmpies- 
sion of the patient The scoiei then copies that numbei into the correspond¬ 
ing block For example, the lating given on scale 1, indicating difficulty in 
sleeping, adds one point to the scoie m clustei I and cluster V. Two un¬ 
shaded blocks in a given column indicate that double weight is given to an 
item. 

The dimensions, established by examining symptoms m a large number 
of patients, are named in psychiatnc teimmology I Acute anxiety, II Con¬ 
version hysteria, III. Manic state, IV. Depiessed state, V. Schizophrenic ex¬ 
citement, VI Paianoid condition, VII. Paranoid schizophrenic; VIII. Hebe¬ 
phrenic schizophiemc, IX. Phobic compulsive. With few exceptions (eg, 
Acute anxiety and Phobic compulsive) the conelations between scales are 
low The scoies are moderately reliable, the median split-half correlation be¬ 
ing 82 Combining two or moie independent latings would be necessaiy to 
get a dependable pictuie of the individual’s condition This is not a serious 
limitation, since clinical decisions aie likely to be based on tiends over sev¬ 
eral weeks It should he noted that the lating scale mdicates the patient’s 
condition rathei than his diagnostic category A patient may shift through 
different patterns of behavior as he progi esses toward recovery or is tern- 



evidence of difficulty in sleeping 



FIG 90 Ratings of a 26-year-old male patient on the Wittenborn Psychiatric Rating Scales (Form copyright ©, 1955, The Psychological Corporation. 
Reproduced by permission ) 
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porarily upset, and the function of the scale is to record such changes rather 
than to give him a label. 

As a further example of rating scale development, we may mention the 
Fels Parent Behavior Rating Scales These scales were developed to study 
the preschool child’s family Trained observers visited each home periodi¬ 
cally and wrote a descriptive repoit; to provide systematic data which could 
be treated statistically, the observer also gave ratings on thirty scales. These 
scales were designed to cover emotional relations, disciplinary methods, and 
values of the home 

The directions and definitions for one scale, for example, are as follows 

Quantity of Suggestion (Suggesting — IS! on-suggesting) 

Rate the paient’s tendency to make suggestions to the child Is the paient con¬ 
stantly offeung lequests, commands, hints, or otliei attempts to dnect the child’s 
immediate behaviorf Oi does the paient withhold suggestions, giving the child’s 
initiative full sway? 

This does not apply to routine tegulations and their enfoicement Rate only 
wheie theie is oppoitumty for suggestion Note that “suggestion” is defined 
bioadly, including dnect and mdnect, positive and negative, veibal and nonver¬ 
bal, mandatoiy and optional 

—Paient continually attempting to dnect the minute details of the child’s rou¬ 
tine functioning, and "free” play as well 

—Occasionally withholds suggestions, but moie often indicates what to do next 
oi how to do it 

—Parent’s tendency to allow child’s initiative full scope is about equal to tend¬ 
ency to interfere by making suggestions 

—Makes geneial suggestions now and then, but allows child laige measure of 
freedom to do things own way 

—Parent not only consistently avoids volunteenng suggestions, but tends to 
withhold them when they aie lequested, or when they aie the obvious leac- 
tion to the immediate situation 

Such lengthy scales, lequmng patient and thoughtful discrimination, con¬ 
trast maikedly with the simple latmg scales used in obtaining recommen¬ 
dations on piospective employees or routine judgments from teachers The 
elaborate definitions and fine subdivision of tiaits permit a much more reli¬ 
able and comprehensive pictuie of the home than a simple foim would In- 
teriater reliability on single tiaits ranges from about .50 to 90 The technique 
is designed to be used by a qualified professional obseiver who has a substan¬ 
tial amount of infoimation to lecoid, mfoimation which could not be commu¬ 
nicated fully in a coaise scale. Such an elaboiate scale is unnecessary when 
the rater has only casual impiessions to convey 

The scale is oiganized around these factors Waimth (first five variables 
in Figure 91), Adjustment, Restnctivencss, Clanty of Policy, and Interfer¬ 
ence The detailed pictuie of five areas which aie themselves furthei differ¬ 
entiated forestalls any tendency to characterize homes as simply “good” or 
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“bad.” The scale lesults show that democratic homes can be cold and con¬ 
flictful, orderly yet affectionate, 01 warm and still maladjusted The piofile 
m Figure 91 illustrates the volume of information recorded m quantitative 
form about a single home The Stones are warm, protective, rather coercive, 
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FIG 91 Rating af the home treatment of Ted Stone (Baldwin et at, 1949, p 28) 


giving, as the authois say, a picture of Testiictive indulgence ” But more can 
be learned from the profile (Baldwin et al , 1949, p 29, see also Baldwin 
etal, 1945) 

The latrng of leadmess of enforcement contributes a definite flavor to the m- 
terpietation The mother is restrictive, but lax m enfoicement and also, we see, 
mild in her punishments The home begins to appear veibal and nagging, but with- 
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out any core of enforced disciplinary policy When latmgs of low adjustment, 
high discord, low effectiveness of policy, and high disciphnaiy friction are atjded, 
the suspicion arises that Ted does not conform to his mother s standards She talks 
and nags, but achieves little The fact that approval is still high in spite of all this 
conflict and discord might be interpreted as a deteimmed effoit by the mother to 
see the boy in the best possible light . A low latmg on understanding makes 
it clear that the mother has little insight into what Ted wants and needs but in¬ 
stead is projecting on him her own motivation 

This rating pattern fits the full clinical description given by the visitor, 
This evidence of validity raises the question whethei careful and elaborate 
ratings cannot accomplish all that a clinical description might hope to do An 
attempt to reduce individuality to a limited list of dimensions always loses 
some idiosyncratic features, however. The ratings characteiize the Stone 
family in terns of those qualities on which all homes can be judged, but not 
m terms of its own recurrent themes and conditions. From the clinical notes 
we learn facts such as these. Mis Stone has had lifelong trouble m foiming 
emotional ties, with the significant exceptions oi hei mothei and hei son, She 
is contemptuous of her husband She thinks that no one, not even her hus¬ 
band, understands that she has “sacrificed hei life foi Ted ” “Having identi¬ 
fied herself completely with her product, it was necessaiy that the child him¬ 
self be immaculate, perfect in behavioi, precocious intellectually" Ted is 
pi one to respnatory infections and subject to alleigies, these intensify his 
mother’s anxiety. Discipline is pulled in opposite dnections by Mrs. Stone’s 
desire for peifection and her identification with Ted “On one occasion when 
he was sent to bed an hour early as a punishment, Mis. Stone decided she 
had been overly seveie and went to the bedioom to lead to lnm foi the extra 
period.” Such descnptive color and texture, while of no use foi statistical re¬ 
search, is informative both to the clinical workei and to the lesearch psy¬ 
chologist. There is no reason to think that even elaborate rating systems can 
replace descnptive accounts in exploratory research or casework 

17, Would an elaborate rating instrument such as the Fels scales be advantageous 
when obtaining ratings of a child by his teacher? 

18. How much of the “individual" information quoted from the caseworker's notes 
on the Stones could have been covered by adding additional traits to the rat¬ 
ing scale? 

19 No evidence on agreement between raters is given in the manual for the Wit- 
tenborn scales Why is this information needed? Plan a study to obtain it. 

OBSERVATION OF BEHAVIOR SAMPLES 

Self-reports and judgments by peers and supervisors are based on a more or 
less haphazard composite of observations The rater has not seen the indi¬ 
vidual in all situations, and selective recall operates in both rating and self- 
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report Systematic observations can give a more accurate description of typi¬ 
cal behavior, A distinction must be made between observations intended to 
cover a representative sample of behavior and observations m a standardized 
test situation, The former attempts to estimate typical behavior fiom a statis¬ 
tically lepresentative sample of situations actually occuirmg m life, the situa¬ 
tions may and usually will differ for diffeient persons, The latter observes 
reaction to the same situation foi everyone The situation used may be quite 
uncommon in the subject’s life 

Field observations —1 e, observation of the subject under his normal cir¬ 
cumstances—are relatively easy to cairy out and for some studies they are 
more suitable than standardized obseivations Many investigators feel that 
it is impossible to know personality unless we watch the subject react to the 
conditions that are most significant for him Different stimuli are sig nifi cant 
for different people Standaid situations are peihaps not as likely to elicit the 
important behavior patterns as aie the noimal (dissimilar) conditions un¬ 
der which the subjects live The difficulty hes in seeing enough of the per¬ 
son’s normal behavior and in obtammg dependable lecords 

Sampling Problems 

Whenever one wishes to know the typical behavior of an employee, a stu¬ 
dent, or a patient, the most duect way to find out is to obseive him m nor¬ 
mal situations If he does not know that we are watching him, we obtain a 
truthful picture limited only by our skill as observers and our persistence 
This is our usual basis for judging associates and friends Judgments based 
upon observation, however, are likely to be untrustworthy on account of 
sampling errors and observer errors 

To know the “typical” behavior of an individual, it is necessary to know 
how he characteristically acts in a particular situation But situations change 
from day to day and from moment to moment If we obseive the attentive¬ 
ness of an employee before lunch, we get a different impression fiom the one 
we would get in midafternoon If we observe cheerfulness 01 pohteness when 
he is worried, our impression may be unfair The only way to be even mod¬ 
erately certain of typical behavior is to study the subject on many occasions, 
which is expensive In practice, one must compromise between perfect sam¬ 
pling and economy 

Inference about individual differences is difficult because one can never 
observe two individuals in the same situation Even when the situation is 
externally constant, previous conditions cause people to behave differently 
When Jimmy fidgets more than John m the classroom, one is likely to infer 
that Jimmy is “restless,” "nervous,” or “jumpy,” If the impression is confirmed 
by repeated observations, this difference seems to be fundamental But if 
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Jimmy usually comes to school without breakfast, if he expects to be criti¬ 
cized by the teacher for pool work, or if he is large for the chairs provided, 
the difference in activity may tell nothing about the boys 1 basic restlessness 
In fact, if conditions weie leverscd, Johnny might be moie lestless than 
Jimmy is now At best, comparative obseivations show how diffeient people 
act undei their piesent conditions, but do not guaiantee that the differences 
would peisist if background conditions changed 

One of the best approaches to precise compaiison is tune sampling In 
time sampling, a set schedule of obseivations is planned m advance The 
schedule is landomized so that each subject is seen under comparable con¬ 
ditions. In one study of social contacts of pieschool children, for example, 
a schedule of one-minute obseivations was diawn up After the obseiver 



Edward— 12/6/28—DST 


AB—Plays with and mauls Paul. 

Teacher intervenes 60" 

CD—Goes up to Alma- Throws cover 

at her 11 u 

EF—Jungle gym loo' 


GH—Somersaults 3" 

IJ—Slide 30" 

KL—In closet 50 " 

K)lA —Knocks Paul down, teacher 

intervenes 2" 


FIG 92 Record obtained in a five minute observation of a preschool child (Thomas et a/, 
1929, p 43) The diagram shows the nursery school play yard and traces Edward's move 
ments Letters mark the start and finish of each activity 


watched a child for one minute, noting all social interaction, he wiote down 
a full record Childien were watched m a predetermined order which was 
altered from day to day During the study each child was observed an equal 
numbei of times duung the fiist five minutes of the free-play hour, during 
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the second five minutes, and so on (Barker et al, 1943, pp 509-525). 

The advantages of short well-distributed time samples is that the cumula¬ 
tive picture is likely to be far more typical than an equal amount of evidence 
obtained m a few longer observations Moreover, errors of memory are re¬ 
duced, since the observei can make full notes during or just after observing 
Time samples are especially suitable for recording specific facts that can be 
expressed numerically, such as the number of social contacts with other chil¬ 
dren A slightly more elaboiate lecord shows the complete activity pattern 
during the observation period Figure 92 leports the behavioi of Edward 
during a five-mmute period, what he did, for how long, and wheie 
A much more extensive sample is obtained m the “day lecord” te chni que 
of Barker and his associates. Their general aim is to study the pattern of a 
child’s life, with particulai attention to the various settings in which he 
moves. For example, they wish to see what behaviors are evoked from small¬ 
town childien in the course of a day, as compared to those evoked from chil¬ 
dren in larger communities Foi tins puipose, an observei goes with the child 
throughout his whole day’s piogiam, from the moment of awakening until 
the end of the day The foim of the record is illustrated by tins description 
of three boys playing with a ciate in a vacant lot (Barker and Wright, 1951, 
pp 349-350) 

5 39 Raymond tilted the crate fiom side to side m a calm, rhythmical way 

Clifford’s feet were endangeied again. Stewart came over and very pro¬ 
tectively led Cliffoid out of the way [Observer’s opinion ] 

Raymond slowly descended to the giound mside the crate 
When Stewait came back around the crate, Raymond reached out at him, 
and growled veiy gutturally, and said, “I’m a big gonlla” Giowhng very 
feiociously, he stamped aiound the “cage” with his arms hanging loosely 
He reached out with slow, gioss movements 

Raymond leached towaid Cliffoid but didn’t really try to catch him 
Then he grabbed Stewait bv the shut 

Imitating a very fierce gonlla, he pulled Stewait toward the crate 
Stewart was passive and allowed himself to be pulled m He said “Why 
don’t you let go of me?” He spoke disgustedly and yet not dispaiagmgly 
Raymond leleased his giasp and ceased imitating a gorilla 
He tilted the crate so that he could crawl out of the open end As he 
crawled out, he lost control of the ciate and it fell over on its side with 
the open end perpendicular to the ground 
Stewart said, “Well, how did you get out?” 

Raymond said self-consciously, “I fell out,” and foiced a laugh 
He looked briefly at me as if wondenng what I thought 
5 40 He slowly and carefully crawled inside and went duectly thiough the crate 
and out the open end 

Stewait and Clifford got in front of Raymond and tried to get linn to chase 
them and continue imitating a gonlla. 

Raymond stood immobile and didn’t cooperate 
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Finally Stewart said to Cliffoid, “Maybe if he’ll follow us through, then we 
can crawl out this end Then we can tip it up and have him caught again ” 

The day record has some advantage in showing the total sequence of ac¬ 
tivities, but it also has disadvantages The unconcealed observer may have 
some effect upon behavior, an effect which cannot be assessed And the use 
of a single full day does not obtain completely typical information for the 
child in question, since each day has its own unique characteristics. Neither 
of these is a serious drawback for the Baiker studies, where the goal is an 
overall report of the noimal expenence of a group of children Observing 
many children, each on a different day, irons out sampling erior so far as 
group data aie concerned, The obseivei is present m all the data and there¬ 
fore does not prevent compaiisons among gioups. 

A series of time samples gives at best a statistical composite of different re¬ 
sponses Responses which the observer counts as the same may actually 
have quite diffeient meanings, and situations which appear similar to him 
may evoke quite diffeient responses. Newcomb (1929) observed boys in a 
summer camp, making daily records of many particular responses such as 
cooperation in aftei-meal woik, fighting with othei boys, and persistence 
When these day-to-day recoids were studied, most boys were found to he 
inconsistent As Newcomb points out, situations aie only superficially alike. 
“Whethei or not Johnny engages in a fight may depend on whether or not he 
thinks he can 'lick 1 his opponent ” The apparent inconsistency of Johnny’s 
action from an observer’s fiame of reference may be highly consistent from 
Johnny’s point of view Couelations weie computed between observed be¬ 
haviors which were supposed to lepresent single traits, such as showing off 
or dominance over peers Behaviois giouped within one of these supposed 
traits correlated little higher than obviously dissimilar behaviors Another 
study found only tiivial couelations (median ,20) between punctuality ob¬ 
served in different situations (Dudycha, 1936) 

Conclusions formed in one situation—even on the basis of many cumu¬ 
lated obsei vations—are valid only for that situation Inference as to how a 
person would act in another situation is warranted only when responses un¬ 
der the two conditions have been shown to be con elated, or when observa¬ 
tion yields so much undeistandmg of his underlying personality structure 
(his stimulus equivalences) that we can see what a new situation means to 
him 

Symonds (1931, p. 5) has commented emphatically on the need for ade¬ 
quate sampling: 

A single obsei vation is unreliable, a single rating is unreliable, a single 
test is unreliable, a single measurement is unreliable, a single answer to 
a question is unreliable. Reliability is achieved by keeping up observa- 
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tions, ratings, tests, questions, measures. . . If you ask one teacher foi 
her judgment of a boy’s trustworthiness, you obtam what she has been 
able to observe in those few nanow classroom situations that appeared 
when her attention was particularly dnected to some act involving hon¬ 
esty An adequate rating, on the other hand, requires the judgment of 
several raters in several situations at several different times. Reliable 
evidence is multiplied evidence 

The extreme vaiiation in performance is illustrated by a study of naviga¬ 
tors Students were taken on missions where their task was to continually 
compute their own position, air speed, etc, by dead reckoning On each mis¬ 
sion, four separate legs were run, and the accuracy of the man’s air speed 
report for the leg was recorded The score for each mission had a split-half 
reliability of .77, this is an indication of the man’s consistency from one leg to 
the next, under the same wind conditions, with the same plane, etc A corre¬ 
lation was also computed between scores on different days While this corre¬ 
lation varied from class to class, the mean reliability coefficient was ,00 (Car¬ 
ter and Dudek, 1947). Diffeiences m score are determined almost entuely 
by transient conditions rather dian by the individual’s ability Under these 
circumstances, even combining information from several missions would not 
give a useful repoit on the individual 

How many observations are required to obtain reliable data depends on 
the problem The experimenter can estimate reliability of samplmg by corre¬ 
lating ratings of “odd” with “even” observations. By this means, it was deter¬ 
mined that 24 or more five-minute time samples permitted “reasonably sta¬ 
ble” estimates of individual diffeiences in preschool children (Arrington, 
1939) In general, many short observations are superior to a few longer sam¬ 
ples of behavior. 

20. Why might dn unfair picture of a child's behavior be obtained if he were al¬ 
ways observed during the first five minutes of the play period and never dur¬ 
ing the second five minutes? 

21. What sorts of information about Edward's personality could be obtained from 
a cumulation of records such as Figure 92? 

22. What sorts of information about Edward’s behavior are discarded in making 
an objective record such as Figure 92? 

23. As a criterion in selection research, would it be better to test every flier with 
repeated landings on the same day or with a similar number of landings spread 
over several days 7 

24. May one say, paraphrasing Symonds, ‘‘Valid evidence is multiplied evidence”? 

Observed Error 

Whenevei a person observes an event, he notices some happenings and 
ignores others This is a necessary difficulty, since any activity has too many 
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aspects foi the mind to attend to all at once. Especially in social situations, 
the complexity of interaction prevents exhaustive leportmg If errois m ob- 
servmg weie meiely random omissions, they would be unimportant. But ob- 
seivers make systematic errors, overemphasizing some types of happenings 
and failing to report others 

Viewing the identical scene, observers give widely difleient lepoits The 
following reports weie wiitten by four obscivers, each of whom saw the 
same motion-picture scene of about ten minutes duiation (from the film 
This Is Robert) The film was shown twice without sound The film se¬ 
quence, taken in the classroom and on the playground, showed several 
activities which levealed much of Robert’s personality The observers were 
directed to note everything they could about one boy, Robeit, and were told 
to use paientheses to set apait mfeiences 01 interpretations Numbeis in 
these accounts, refeiring to scenes in the film, have been inseited to aid 
comparison 

Observer A (2) Robert leads woid by woid, using fingei to follow place 
(4) Observes gnl m box with much pieoceupation (5) Duung singing, he in 
geneial doesn’t participate too actively Interest is pait of time centered elsewheie. 
Appeals to lespond most actively to sections of song involving action. Has tend¬ 
ency foi seemingly meaningless movement Twitching of fingeis, aimless thrusts 
with ai ms 

Observer B (2) Looked at camera upon cnteiing (seemed perplexed and 
mteiested) Smiled at camera (2) Rends (with apparent inteiest and with a fair 
degiee of facility) (3) Active m roughhouse plav with gills (4) Upon being 
kicked (unintentionally) by one girl he lesponded (angnly) (5) Talked with 
girl sitting next to him between singing penods Paiticipated in singing (At times 
appealed enthusiastic ) Didn’t always sing with otheis (6) Paiticipated m a dis¬ 
pute m a game with otheis (appeared to stand up foi his own rights). Aggiessive 
behavioi towaid anothei boy Turned pockets inside out while talking to teacher 
and other students (7) Put on oveishoes without assistance Climbed to top of 
laddei rungs Tiled to get lung which was occupied by a girl but since she didn’t 
give in, contented himself with anothei place 

Observei C (1) Smiles into camera (curious). When gioup bleaks up, he 
makes nervous gestuies, throws aim out into an (2) Attention to reading lesson. 
Reads with seiious look on his face, has to use line markei (3) Chases girls, teases 
(4) Girl kicks when he puts hand on her leg Robeit makes face at her (5) Sing¬ 
ing Sits with mouth open, knocks knees togethei, sciatches leg, puts fingers in 
mouth (seems to have seveial neivous habits, though not emotionally oveiwiought 
or self-conscious) (6) In a dispute ovei porchesi, he stands up for Ins rights 
(7) Short dispute because he wants rung on jungle gym 

Observei D (2) Uses guide to follow words, reads slowly, fairly forced and 
with caieful formation of sounds (perhaps unsure of self and fearful of mistakes) 
(3) Perhaps slightly aggressive as evidenced by pushmg younger child to side 
when moving fiom a position to another Plays with other children with obvious 
enjoyment, smiles, runs, seems especially associated with gnls This is noticeable 
m games and m seating m smgmg (5) Takes little interest m singmg, fidgets, 
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moves hands and legs (perhaps shy and nervous) Seems in song to be unfamiliar 
with words of mam pait, and shows dismteiest by fidgeting and twisting around. 
Not until chorus is leached does he pick up interest His especial friend seems to 
be a particular girl, as he is always seated by hei 

Every observer is more sensitive to some types of behavioi than others 
How does he regard nailbitmg, failuie to look one m the eye, or profanity? 
If he considers these significant, he will note them and base his impression 
on them In the same situation, another obseiver might give greatest atten¬ 
tion to voice modulation, careful use of grammar, or friendliness of conversa¬ 
tion Ideally, an observer would base his impiession on every leveahng act, 
but when he is looking for one thmg, he necessanly overlooks something 
else 

Observers mterpret what they see If observers recorded only objective 
facts, others studying the data might reach quite different interpretations, 
but people always try to give meanings to what they see When they make 
an interpretation, they tend to oveilook facts which do not fit the interpre¬ 
tation, and may even invent facts needed to complete the event as inter¬ 
preted 

25. What da you think really happened in scene (4)1* Which observer came closest 
to adequate reporting of it? 

26. Which of the numbered scenes appears to give the most significant information 
about Robert? How many of the observers reported that information? 

27. Did the observers of the film about Robert succeed in identifying and marking 
all their judgments and hypotheses? 

28. Do the observers of the film about Robert ever disagree, or are the differences 
entirely due to omissions and oversights? 

29. A clinical psychologist asks a parent how well his 6-year-old child gets along 
with other children Illustrate how each of the following errors might operate: 

a. The observer has not observed an adequate sample for |udgmg typical be¬ 
havior. 

b. The observer notices events which fit his preconceived notions 

e. The observer is likely to note the behaviors he considers significant and to 
ignore others of equal importance, 
d. The observer may give a faulty interpretation to an event 

Systematic Recording Where possible, it is desirable to record countable 
units of behavior For example, the extent to which factory workers attend 
to their work may be described by a time record which notes the exact mo¬ 
ments when they are at work, and the time spent m looking aiound, obtain¬ 
ing tools, and visiting. The causes of distractions can also be noted. Such 
records for different workers and departments can be analyzed both for 
judging the workers and for planning rest periods or improved tool distribu¬ 
tion. 

Child development has been studied through records of social contacts, 
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play activities, speech, and other objectively defined behaviors (Barker et 
at, pp 509 ff.; Thomas, 1929) Such piecise reports are especially useful for 
measuring changes, since the observer’s memory cannot compare perform¬ 
ance now with perfoimance several months ago 
Even if the behavior observed is too varied for direct tabulation, it may be 
possible to define categories of actions so that the observer needs only to 
check each incident as it occurs One of the best examples is Bales’ method 
(1951) of categorizing social interaction for the purposes of research on 
small gioups. Twelve categories desciibmg various types of response are de¬ 
fined, including: 

Shows sohdanty, laises othei’s status, gives help, reward 
Shows tension release, jokes, laughs, shows satisfaction 
Asks foi oiientation, mfoimation, repetition, confirmation 
Disagrees, shows passive rejection, formaliLy, withholds help 

The obseiver tallies lesponses moment by moment. By noting who makes 
each remark and its approximate time, he can keep a full lecord of the inter¬ 
play of thought and emotion An “interaction recoidei” using a motoi-duven 
tape has been designed to facilitate such lecoiding, Later analysis can ex¬ 
amine individual differences such as the emeigence of conflict and other 
group piocesses 

30. What advantages and disadvantages would a checklist or schedule have for 
each of the following purposes, compared to a one-paragraph descriptive 
report'* 

a. A social agency wishes its visitor to report the condition of homes of its 
clients, including furnishings, conveniences, and neatness. 

b. A department store sends shoppers to be served by its clerks and to ob¬ 
serve their procedure and manner. 

e. A state requires an observation of the applicant’s driving before issuing a 
license to drive. 

31. An investigator wishes to measure punctuality, for research purposes. He sta¬ 
tions himself where he can observe the arrival of each student attending a 
particular class Number of minutes early or late is recorded for each person. 
Records are made on several days, What assumptions are involved in using 
the average of these records as an index of punctuality? 

32. Tape recordings of group discussions are used to study individual differences 
in dominance, leadership, and other traits What types of observable informa¬ 
tion about personality could not be obtained from the tape 9 

j ^AfTecd otal Recor ds Although objective counts and tabulations are well 
suifed"foTesearch,'tlieir information is of limited value for individual guid¬ 
ance Anecdotal records escape the bleakness of quantitative methods, of¬ 
fering a more lifelike sketch of the subject. The observer is free to note any 
behavior that appears significant, rather than having to concentrate on the 
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same traits for all subjects Often the anecdotes are repoits of incidents noted 
by a teacher or supervisor m daily contacts 

In an anecdotal record, the observer describes exactly what he observed, 
keeping inteipietation and fact sepaiate The record is made as soon after 
the action as possible, to eliminate errors of recall Cumulated over a period 
of time, the incidents piovide a licher pictuie of behavioi than any other 
equally simple technique, The following are typical anecdotal reports, 

Paul, aftei projecting the film for the class, took it back to the office (where I 
happened to be) to lewxnd He is not very skilled, and missed his timing, so that 
much of the film cascaded onto the floor instead of going onto the takeup reel 
John came up just then and said something saicaspc about Paul’s clumsiness, Paul 
gave no answei, but kept on at woik with no change of manner and a stolid face 
Richard, who had been watching Paul, turned on John, told him to “shut up and 
give Paul a chance,” and mutteied something about “some of these kids make me 
sick” (Paul seems to suppress emotion, he ceitamly heard John’s very unpleasant 
tone ) 

Joan spent the entire science penod wandering from gioup to group instead of 
helping Rose as she was expected to. She intei rapted many of the others, telling 
them they weie doing the woik wiong She asked a lot of (foolish) questions 
(“Does filter papei make ceitain things go through or just keep ccitam things 
out?”) and was teased a good deal by the boys. By the time Rose was finished she 
returned, Rose was quite angry, but they made up and Joan helped put things 
away But on her fiist tiip to the storeroom she stayed to plate a gold ring with 
mercury, while Rose made lepeated trips with the equipment 

The repoiter has two responsibilities he must select incidents woith re¬ 
porting, and he must be objective Both incidents charactenstic of the per¬ 
son and striking exceptions to his normal conduct are helpful. The typical 
incidents provide a moie individualized picture than the hackneyed trait 
names that would otheiwise be used—friendly, showing initiative, rude, and 
so on. Exceptional actions are rarely i eported m ratings and general impres¬ 
sions, but they too are significant A single incident showing interest in the 
company’s welfare from a man known as a troublemaker or a sign of enthusi¬ 
asm foi learning on the part of a hoy who rebels against school may be the 
key to a new and successful treatment The observer must weed out value 
judgments and interpretations, attempting to repoit the exact occurrences, 
including significant preceding events and environmental conditions One 
can never report “everything” about the incident The reporter selects for his 
record the facts he considers relevant 

Single anecdotes tell little. As anecdotes accumulate, however, they begin 
to fill m a pictuie of the person’s habits If a particular response is typical, it 
will recur An effective method of determining personality characteristics of 
an individual is to search through the anecdotes about him to detect repeti- 
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tions A summary based on these recurring patterns usually requires confir¬ 
mation by further directed observation 


Suggested Readings 

Bibei, Barbara E , & others Recording spontaneous behaviois Life and ways of 
the seven-year-old. New York Basic Books, 1952 Pp 33-53 
An account of piocedures used in ten-minute schoolroom observations, to¬ 
gether with illustrative anecdotal records and evidence of obseivei reliability, 
Gronlund, Norman E Validity of sociometnc lesults Sociometuj in the classroom 
NewYoik Hai per, 1959 Pp 158-188 

A leview of studies shows how sociometnc choices of school children relate 
to observed behavioi, teachei opinions, and adjustment 
Lmdzey, Gardner, & Boigatta, Edgai F Sociometnc measuiement. In Lmdzey 
(ed), Handbook of social psychology, Vol I Cambridge Addison-Wesley, 
1954 Pp 405-448 

A comprehensive summary of the major souometue techniques includes coi- 
lelations between sociometnc evidence and othei meusmes of personality 
The authois diaw paiticulai attention to limitations of leseaich oi practical 
decisions based on sociometnc findings alone 
Newman, Frances B The development of methods m the adolescent giowth study 
In Frances B Newinan and Harold E Jones, The adolescent m social groups, 
Appl Psychol Monogr, 1946, No, 9,16-29 
This descnbes difteient techniques, ningmg horn quantitative latings to nai- 
lative accounts, used in the same piogiam loi obseivmg adolescent peisonal- 
lty Special advantages of each appioach aie indicated Subsequent chapteis 
give information on reliability and validity, and the use of the data in case 
analysis 

Piescott, Daniel A Inteipieting behavioi The child in the educative process 
New York, McGraw-Hill, 1957 Pp 99-150 
Anecdotal lecoids collected on one boy throughout a school year aie com¬ 
pared to show consistencies and deviations fiom Ins normal pattern The dis¬ 
cussion shows how teacheis foim and lest hypotheses when using such lec- 
ords as a case-study technique Seveial othei chapteis m the book also give 
useful information on the collection of anecdotal mfoi malion 
Tuddenham, Bead D Studies in reputation II The diagnosis of social adjustment 
Psychol. Monogr , 1952, 66, No 1 

Repoils on school children obtained by the nomination technique can be used 
foi personality analysis Five illustrative rounds aie interpieted. 
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Performance Tests of Personality 


THE preceding chapters have considered the well-established and unques¬ 
tionably useful techniques for studying personality intei est measures foi use 
in counseling, adjustment inventories for screening pmposes, sociometnc 
and peer latmgs, empirically keyed predictive questionnanes, and system¬ 
atic sampling of behavioi Some of these are of value to the piacticmg pei- 
sonnel psychologist and others are better suited to gathering research data, 
but each of them is capable of giving reliable data, the major sources of er¬ 
ror in interpretation have been identified, and scoie interpretations are ade¬ 
quately suppoited by combined evidence and theory 
We now turn to piocedures whose value is unsettled—indeed, is vehe¬ 
mently disputed Although peifoimance tests and projective techniques 
have been in use foi about thirty years, they have reached a much less ma¬ 
ture stage of development than methods discussed to this point The com¬ 
plexity of peisonality and the instability of personality theory are one source 
of difficulty When there is no consensus as to the most important traits to 
measure, 01 even, as Allpoit says (Lmdzey, 1958), on whether it is fruitful to 
conceive of peisonality m teims of baits, test developers have no taiget on 
which to concentiate Conversely, when theie are no outstanding tests m an 
aiea, research is scattered so widely that no coherent body of evidence be¬ 
comes available as a base on which to build theory When the theory of in¬ 
telligence was confused and primitive m the fiist quarter of this century, the 
focus Binet’s scale gave to research effoit made it possible to move toward a 
much clearer theory legardmg the nature and growth of ability 

Since there aie no performance tests of salient importance and few clear 
principles, we must confine ourselves in this chaptei to describing enough il¬ 
lustrative tests to show the range of approaches In addition to a variety of 
peiformance tests, this chapter describes projective tests, there being no 
sharp distinction between the two classes We shall present some evidence 
on the validity of scores as psychometric predictors, that is, on the use of per¬ 
formance and projective tests as quantitative measuring instruments. Many 

53? 



540 ESSENTIALS OF PSYCHOLOGICAL TESTING 


of these instruments, however, are used primanly for impressionistic assess¬ 
ment in which the scoies become raw material to be integrated with other 
data into a portrait of the whole person. The complex and controversial is¬ 
sues regarding such assessment we reseive for Chapter 19 

The aim of the peifoimance test may be clarified by contiastmg it with 
the time-sampling method of deteimimng typical behavioi by observation. 
The limitations of time sampling are its high cost and the fact that scores, 
when obtained, depend upon both the subject and the situations in which he 
happens to have been obseived, It has been the great dieam of personality 
measuiement to invent procedures which would give quantitative results, 
would directly represent behavior lather than biased impiessions, and would 
permit diiect comparison of individuals in the same situation 

A performance test is an observation m a standard situation designed to 
elicit a particular type of response'. One tiait of great mtei est, foi example, 
is how the person conti ols and expresses aggiession It is difficult to judge 
this by observing the daily life of mosTsubjects, because the person is only 
occasionally in an aggression-provoking situation Testers have theiefore de¬ 
veloped standardized pioceduies for annoying the subject, such as condemn¬ 
ing the opinions he voices in a standaid intei view This almost certainly 
does arouse hostile feelings, and the subject’s behavioi in this shoit test may 
be more leveahng than several horns of field obsei vation. 

Galton once compaied psychological testing to the geologist’s “sinking 
shafts at cutical points” to obtain samples of significant matenal. Whereas 
ratings letuin, for the most part, stuctly suiface impressions, and the time 
sample sinks its shafts entirely at random, the peifoimance test is designed 
lo provoke exhibitions of tiuly critical behavioi. The usual features of such a 
test are as follows 

,,« The stimulus situation is made as neatly umfoim as possible for all 
subjects 

- • The situation is designed to permit vanation in those types of behavior 
which the tester wishes to obsei ve 

j « The subject is led to believe that one characteristic is being tested while 
the observer is actually observing some other aspect of peifoimance. 

• The observei makes caieful lecords of the subject’s method of pcrfoim- 
ance, rather than noting only the amount pcifoimed. 

An example is the Opeiational Stress technique developed to assess 
whether men enteung pilot training can resist piessuie. Duung administra¬ 
tion of an apparatus test, the candidate is subjected to stress-pioducing stim¬ 
uli The apparatus has seven conti ols (pedal, throttle, stick, and various le¬ 
vers), which the examinee resets continually as signal lights change and 
buzzers sound. The time requued to react to each signal is recorded elec¬ 
trically. The examinee is told that he will be observed by a concealed ob- 
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server “just as a checkpilot will rate you m flying.” Administration is stand¬ 
ardized: one minute of rest and anticipation, one minute of directions 
regarding signals and conti ols, and three short test periods In each period, 
the exammee is given increasingly reproving “stress directions” while he is 
busily moving levers The signals are made moie complex In test period C, 
the pattern of hghts changes six times, after mtervals of about fifteen sec¬ 
onds, while the examiner is delivering the following speech in an urgent 
manner. “Don’t make hghts flicker on and off. Be steady, . . Quit making 
eriors You aren’t moving fast enough. . . Moie speed . . Hurry and 

stop the clock Last chance. . . Set controls quickly . You are 

still making enors ” The concealed obseiver, meanwhile, makes extensive 
ratings of manner and reaction to criticism, and objective clock scores are 
recorded (Guilford, 1947, pp 660-664, Melton, 1947, pp 811-814). 

The greatest advantage of a test observation is that it reveals characteris- - 
tics which appear only infiequently m normal activities—characteristics I 
such as biavery, reaction to fiustiation, and dishonesty. Second, desire to 
make a good impiession does not invalidate the test In fact, just because he ’ 
is anxious to make a good impiession, the subject reveals more than he nor¬ 
mally would It is necessaiy, however, to take this motivation into account m 
interpreting results. The thud advantage of the performance test is that it 
comes closei than other techniques to comparing subjects under identical! / 
conditions 

Perfoimance tests vary greatly in purpose and m design. They may be 
strictly psychometiic instruments measuring single narrowly defined con¬ 
structs such as persistence m routine work, or they may be a basis for impres¬ 
sionistic evaluation of the person’s total life-style. They may be worksamples 
for predicting success m a specific assignment, or cross sections of behavior 
without reference to any single future situation. 

Situations used to eheit perfoimance range from highly structured to 
almost totally unstructured. A situation is structured if it has foi all subjects a 
definite meaning An unstructured situation presents so few cues or has so 
little pattern that he can give it almost any meaning he wishes A common¬ 
place unstructured stimulus is the strange sound in the night Is it die wind? 
a burglar? the cat? water dripping? The interpretation we make is strongly 
influenced by our interests, by fears conscious and unconscious, and, of 
course, by knowledge In a structured situation, the subject knows exactly 
what he is expected to do and how he is expected to do it In the un¬ 
structured situation, he guides himself The more ambiguous the situation, 
die moie opportunity there is for individual mediod of mteipretation and 
performance An extremely unstructured situation is established in Waeh- 
ner’s (1946) piocedure for studying personality. She observed her subject’s 
behavior and products after turning him loose m a studio equipped with all 
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types of art media and materials, with little more mstiuction than “You may 
do anything you like with these,” 

Highly stiuctured tasks are excellent for measuring ability just because 
they force everyone to try the same thing Projective situations, at the op¬ 
posite extreme, use almost totally unstiuetuied stimuli The piojective test is 
so named because it permits the subject to pioject into the situation his un¬ 
conscious thoughts, wishes, and fears (L, K Frank, 1939) Thus the house¬ 
holder who interprets the cieak in the daik as a buiglar may be revealing 
that he is more anxious than another man, who intei pi ets the same stimulus 
as a natural phenomenon and goes back to sleep 

1 In each of the following situations, discuss whether it would be preferable to 
employ observations in natural conditions, or standardized observations where 
conditions are fixed in advance and identical for all sublets 

a. The telephone company wishes to rate its operators on courtesy and clarity 
of speech It is able to tap conversations and make recordings 
b It is desired to screen Navy personnel for tendency to panic under conditions 
of extreme noise, as in amphibious landings 
c. An investigator wishes to study the habitual recklessness of 7-year-old boys 
in climbing and jumping 

2. To what extent may each of the following be considered an unstructured 
stimulus? 

a. A teacher, during a test, glances up from her desk and barely observes a 
hasty movement of one boy who is pulling his hand into his lap from the 
aisle 

b. A group of people play duplicate bridge, the same set of hands being played 
at each table 

c. A questionnaire is designed to obtain information about age, income, edu¬ 
cation, etc All possible answers are anticipated and presented on the blank 
in multiple-choice form 

3. To what extent is each of the following unstructured’ If the test is at all un¬ 
structured, discuss whether that is an advantage or a disadvantage 

a. Stanford-Bmet test, Memory for Sentences 
b Wechsler Comprehension test 

c. A test of addition which presents in random order the combinations up to 
9 + 9, the pupil being directed to do as many items as he can in the time 
allowed 

d In the Porteus test (Figure 1), the subject is to solve a maze The time it takes 
him to trace the correct path with his pencil is scored 


STRUCTURED TESTS MEASURING SINGLE TRAITS 
Character arid Persistence 

The Character Education Inquiry of Hartshorne and May (1928, 1929, 
1930) was the only extended effoit to evaluate personality by strictly quanti¬ 
tative and objective methods, Charactei traits can be validly assessed only 
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by subjecting people to temptation in a situation where they believe they 
can violate standards without detection Traits studied by Haitshorne and 
May include truthfulness, honesty with money, persistence, coopeiative- 
ness, and generosity 

Honesty with money was tested by presenting arithmetic pioblems in 
which each pupil had to use a boxful of coins The box provided for each pu¬ 
pil was secretly identified At the end of the woik each pupil earned his own 
box to a pile m fiont of the room Since pupils weie unaware that boxes 
could be identified, many took advantage of the opportunity' to keep some of 
the money Honesty in a situation involving piestige was tested by asking 
the child to do an impossible task, such as placing marks in small circles 
while keeping his eyes closed (Figure 93) Many childien turned in “suc- 

CIRCLES PUZZLE 
First Trial 

Wait for the signal for each trial Put the point of your pencil 
on the cross at the foot of the oval Then when the signal is given 
shut your eyes and put a small cross or X in each circle, taking 
them in order. 



FIG 93 An "improbable achievement" test of honesty (Hartshorn* 
and May, 1928, p 62) 
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cessful performances” which could have been obtained only by cheating. 

Motivation to work has been of particular interest because it is thought of 
as the link between aptitude and achievement. The employer, the teacher, 
the clinician, and all othei users of tests wish they could predict whether a 
person’s behavior will bear out the promise shown m tests of ability. Many 
investigators have explored possible performance tests Hartshorne and May 
tested how long children persist as a task becomes difficult. Pupils read a 
story which builds to a climax. “Again the terrible piercing shriek of the 
whistle screamed at them. Charles could see the frightened face of the engi¬ 
neer . . Here the examiner tells them that if they wish to learn the end¬ 
ing they must read the difficult piinted material that follows: 

CHARLESLIFTEDLUCILLETOHISBACK“PUTYOURARMSTIGHT 

AROUNDMYNECKANDHOLDON 

NoWlioWTogETBaCkoNthETREStle.HoWTOBRingTHaTTErrIFIED 

BURDeNOFACHiLDuPtO 

fiN ALly tAp-taPC AME ARHYTH Month e BriD GeruNNing fee Tfee 
TcomlNG 

The pupil separates each word with a veitical mark as he deciphers it; the 
amount deciphered is an index of peisistent effort (Hartshorne and May, 
1928, p. 292) Some other tests have determined how long the subject will 
continue to work on an exceedingly difficult or impossible problem, pre¬ 
sented m a series with solvable problems 

4. If Bill does better than Fred on the circle-dotting test of honesty, what conclu¬ 
sions can be drawn about Bill's character? 

5. Joe is a known delinquent, having gotten into trouble together with a gang of 
boys for several minor thefts and disturbances. How can you explain the fact 
that he does well on all the tests of honesty, cooperation, and generosity? 

6. The test illustrated in Figure 94 is a test of self-control, requiring the pupil to 
work in the presence of attractive distractions Does this measure the same mo¬ 
tivational factors as the story-completion test of persistence? 

Perceptual and Cognitive Styles 

The psychologist observing any pioblem-solvmg behavior is quickly im¬ 
pressed by individual differences in the way subjects attack problems and 
surmount difficulties Individual tests such as the Wech’sler* are'comffionly 
used as opportunities to observe such styles or habits (see pj 191). Informa¬ 
tion of this type might predict wliat pioblems a person could best deal with 
and what errors he would make, and might open the way to remedial train¬ 
ing to improve his thinking. 

Tests of cognitive piocesses are concerned primarily with how a person 
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organizes information. This patterning has been the chief interest of Gestalt 
psychology, and many of the tests in this area come fiom Wertheimer and 
othei Gestaltist mvestigatois Gestalt psychology has always been concerned 
with die brain as an information-processing organ If die organism is an in¬ 
formation-processing system, simple perceptual tests may identify its con¬ 
stant characteristics These tests are analogous to the test performed by an 
electronic technician when he puts a perfect sine-wave signal into an ampli¬ 
fier and examines the distorted signal put out by the speaker. This gives the 
unique “signature” of that paiticulai system 

Simple stimuli have been used to test the functioning of the total percep¬ 
tual system Flicker fusion and appaient movement may be described as 
examples When a peison views a hght through spaces m a lotatmg shutter, 
“fhckei” is perceived at low rates of lotation At high lates, howevei, no m- 
tenuption 01 flicker is noticed As we increase the rate gradually, the subject 
can leport the point wlieie die flicker just disappear The "fusion thresh¬ 
old” is fanly stable, and there are great individual diffeiences. It has been 
suggested that this fusion point piovides an index of the ability of die nerv¬ 
ous system to register details of incoming stimulation, and measuies of 
fhckei fusion have been found useful m diagnosis of brain damage Halstead 
(1951) comments that fusion "lepreseuts a diamatic change in consciousness 
for die subject For once he reaches the late at which separate flashes . . 
fuse for him, he cannot tell the unsteady hght from a steady one He has 
broken with physical reality The rate is much higher in our noimal individ¬ 
uals than in oui fiontal biain-injuied patients It is as if the mental engine 
were lunnmg m the brain-injuied, but running on inadequate power. It fails 
at the first little hill . It seems clear that the test reflects an important as¬ 
pect of cerebral metabolism.” 

When two lights, side by side, are flashed on and off m quick succession, 
the hght appears to jump back and foi th This “phi phenomenon,” as Wert¬ 
heimer called it, is die basis for traveling light patterns m theatei marquees 
and neon signs In appaient movement the neivous system mtegiates stimu¬ 
lation into a pattern Klein and Schlesinger (1951) arranged to vary the in¬ 
terval between flashes As the interval is reduced there is a definite dneshold 
wlieie apparent movement first appeals and a second thieshold where it dis¬ 
appear, beyond which point the subject sees two steady lights The separa¬ 
tion of these tluesholds, i e , die range of intervals which permit an impres¬ 
sion of movement, is much wider for some subjects than odieis. 

Other tests deal with mental flexibility and rigidity. It is frequendy ob- 
seived that unsuccessful problem solvers cling to mcoirect ideas—for exam¬ 
ple, repeatedly entering the same blind alley m a maze. Successful adapta¬ 
tion lequrres reorganizing perceptual fields on the basis of new information 
or new requirements. Investigators associated with Wertheimer in Berlin m 
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the 1920’s invented the “Water Jar” or Emstellung test (Luchins, 1942). 
Evnstellung may be translated approximately as “mental set” or “orientation ” 
The test makes use of watei-jar problems like those of the Stanford-Bmet: 
“If you have a 7-quart jar and a 4-quart jar, how can you get exactly 10 
quarts of water' 1 ” To follow the logic of the test, call the jars A(7) and B(4) 
The solution is. Fill A, fill B from A (leaving three quarts in A), empty B, fill 
B from A, fill A. Thus thi ee quarts are obtained by the rule (A — B), and the 
additional seven quarts fiom A, i e , (A — B)+A=10 The series of piob- 
lems m a test might be as follows 


Jars 






To Be 





A 

B 

C 

Obtained 



Rule 

!e 

7 

4 

— 

10 

2A - 

- B 


a 

21 

127 

3 

100 

B - 

A - 

2C 

b 

14 

163 

25 

99 

B - 

A - 

2C 

c 

18 

43 

10 

5 

B - 

A T- 

2C 

d. 

9 

42 

6 

21 

B - 

A - 

2C 

e 

20 

59 

4 

31 

B - 

A - 

2C 

f 

23 

49 

3 

20 

B - 

A - 

2C or A — C Critical 

9 

15 

39 

3 

18 

B - 

A - 

2C or A + C Critical 

h 

28 

76 

3 

25 

A - 

C 

Extinction 

1 

18 

48 

4 

22 

B - 

A - 

2C or A -J- C Critical 

1 

14 

36 

8 

6 

B - 

A - 

2C or A — C Critical 


The subject may be given help m solving the first few problems The long 
senes of problems a to e solved by applying a particular rule builds up a 
mental set to use that formula “Critical” and “extinction” problems are then 
introduced In a critical pioblem such as f, the “set” solution works but 
there is a much easiei way to achieve the answei In the extinction pioblem 
h the set solution does not woik, but another simple rule can be found by the 
flexible subject 

To get a good scoie in the Emstellung test the subject must attend to the 
immediate pioblem, discarding memories of the pievious solutions Inflexible 
behavior perhaps indicates inability to separate conflicting sources of infor¬ 
mation Three other tests relevant to this specific aspect of mental function¬ 
ing may be described briefly 

The Embe dde d Figures test (EFT) based on work of Gottschaldt (1926) 
presents a stiange geometric pattern and requires the subject to find it m a 
larger complex field In some veisions, the background is colored irregulaily 
to increase confusion The score is the time requned to solve the problems 
(Witkm, 1950) 

The Stroop Goloi Word test uses three test sheets The first sheet consists 
of color names to be read as fast as possible The second sheet consists of 
rows of dots whose colors are to be named rapidly The third and most im¬ 
portant sheet agam presents color names but this time the words are printed 
m color. The colors used conflict with the names, the woid yellow being 
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printed in red, for example The subject is required to call off the colors as 
rapidly as possible Finally, he is asked to read the words on this sheet. The 
decline m speed from second to third trial (color naming) and from first to 
fourth (leading) indicates the degree to which conflicting cues block his 
thinking 

The Rod and Fiame test of Witkin (1949) is similar in psychological con- 





■ 
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I 



FIG 95 Problems from an Embedded Figures test 


ception, though radically different m stimulus material. A person ordinarily 
judges which way is up? by combining visual and kinesthetic cues A “crazy 
house in which walls and objects are built at a slant requires the subject to 
disregaid visual cues and rely wholly on bodily cues. In the Witkin experi¬ 
ment in a dark room, the subject is strapped into a fixed position in A chair 
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which can be tilted He is then ashed to judge when an adjustable luminous 
rod several feet in front of him is in the upright position The person’s success 
on the rod alone is a measure of kinesthetic acuity, When the rod is placed 
within a luminous tilted square frame, however, visual cues tempt the sub¬ 
ject to call the rod vertical when it is parallel to the sides of the frame Most 
subjects judging the rod-m-fiame select as veitical a compromise position be¬ 
tween the gravitational vertical and die tilt of the frame The greater die tilt 
of the chosen position, the less the subject has cast aside the irrelevant cue, 

A second test of the same nature is Witkm’s Tilting Room, where the entire 
chamber, in which the chair is mounted, can be tilted. 

Validity of Structured Tests 

The performance tests have seveial common characteristics Each is em¬ 
ployed to measure a stable personality trait The investigator usually hopes 
to obtain, from a measurement at one moment m time, information on the in¬ 
dividual’s general level of persistence, rigidity, prejudice, etc Peifoimance 
tests have also been used to study how persistence, for example, varies over 
time or with changing experimental conditions. In discussing field observa¬ 
tions we noted that numerous samples are required to estimate typical be¬ 
havior, because any particular observation catches the subject only m one of ' 
many possible situations. By standardizing the situation, eliminating the ' 
random variation fiom subject to subject, the performance test hopes to 
make extensive sampling unnecessary. 

Motivation can be more nearly standardized for the performance test than 
for any other peisonality measure, In the field observation the subject is 
given no directions, simply exhibiting whatever motivation he brings to his 
own affairs In EFT, the Cucle test of honesty, and in neaily every other per¬ 
formance test, the subject is told that an ability is being measured This pro¬ 
vides him with a cultuially defined ideal of behavior; he understands, m 
each case, how he can earn a high score and understands that a high score is 
desirable The “good peiformance” referied to in the directions is not the 
performance the investigate is observing, for example, the child who "raises 
his score” on the Ciicle test by peeking lowers his honesty score A perform¬ 
ance test of personality tries to standardize motivation to the same degree 
that motivation for a test of mental ability is standardized. Motivation is not 
unif orm for every subject, but neither is it uniform m a school achievement 
test or a test for selecting policemen. 

Neaily all performance tests contain an ability component which is nrele- 
vant to the personality trait supposedly examined. Some control for level of 
ability is therefore required In the Color Word test the reading rate without 
interference is a necessary baseline for the interference measurement Gen- 
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eral reasoning or spatial ability accounts for as much of Embedded Figures 
peiformance as does difficulty in handling perceptual interference. Separa¬ 
tion of ability fiom personality factors m problem-solving tests is not easy, 
and may not be leasonable to attempt Embedded Figuies correlates 35 to 
60 with ability tests such as Block Design, Number Senes, and Thuistone’s 
tests of the spatial factor Yet these are problem-solving tests just because the 
answer is “hidden." If there wcie no intei fenng stimuli theie would be no 
problem 

Do perceptual and cognitive tests measuie nothing but general ability? 
Tlus suspicion is leadily dismissed. Although they overlap with ability tests, 
particularly m samples of college students, they also carry mfoimation not 
piedictable fiom ability tests An entiic batteiy of intellectual, spatial, and 
psychomotoi tests accounted for only half of the reliable variance in Em¬ 
bedded Figures m one study (Guilford, 1947, pp 895, 897) 

Performance tests aie closely linked to specific theones about personality 
and biain functioning, m this they conti ast with most questionnan es and 
field obseivations. Haitshome and May assumed that chaiacter consisted of 
collections of lesponses oi habits and sought to measuie those habits by sam¬ 
pling The perceptual and cognitive tests assume that mental functioning has 
a definite oveiall structuie and seek to measure specific subpiocesses. Since 
each performance test relates to one nanow element m behavior, it gives 
little basis foi descubing the overt personality as a whole Stiuctuied tests of 
bioad traits like cheerfulness and friendliness do not exist and indeed are 
difficult to imagine. 

The most important question about structured tests is them degiee of 
generality If honesty is a geneiahzed habit, the Chicle test will coirelate with 
honesty m many moie significant situations. If inability to sepaiate two 
stiearns of mfoimation is a geneial pattern of behaviot, such superficially 
dissimilai tests as Coloi Woicl and Rod and Fiame will conelate And if so, 
we can expect the same tiait to be impoitant in many types of problem solv¬ 
ing 

The question nail e obtains mfoimation by means of geneial questions deal¬ 
ing with aveiage situations, foi example “Do you feel unhappy much of the 
time?” The hope is that these supeificial summaiy questions will permit m- 
feiences to specific situations The stiuctuied peifoimance test, on the other 
hand, starts with a single aitificial situation and hopes that behavior m that 
situation is largely determined by some fundamental quality of habit, tem- 
peiament, or biain stiuctuie which will influence lesponse m situations 
having a very diffeient suiface appearance The test is meaningless unless it 
measuies something accuiately It is equally pointless to develop tire test if 
it measures beliavioi only m this specific task If it correlates repeatedly 
with tasks which aie on the surface quite dissimilar, it takes on considerable 
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psychological interest It can then be regarded as measuring a difference be¬ 
tween individuals which is in some sense fundamental Once this is estab¬ 


lished, a program of validation should be undertaken to determine just how 
much of socially significant behavioi such a score can account foi 

The developers of performance tests must study the potential usefulness 
of their methods in three stages establishing that the test measuies some 
chaiacteristic dependably (leliability studies), establishing that the charac¬ 
teristic is found m distinctly different tasks or situations (genei ality), and es¬ 
tablishing that the characteristic is related to socially significant aspects of 
behavior (criterion-onented validity) 

We cannot point to any one “official” version of a performance test, and we 
have no manual systematically reporting its technical qualities Each investi¬ 
gator modifies a test for his own purposes and repoits such findings about 
reliability and validity as emeige fiom studies of his own theories Foi this 


reason, we can give only illustrative rathei than definitive evaluations of the 
tests we have described V " M O ^ 

\' \ b'l 

7 What "ability" is involved in the Rod and Frame test and how may it be cor¬ 
rected for’ 

8. Are the tasks used in performance tests of personality any more “artificial" 
than those used in aptitude tests’ 

9. Discuss this statement “A test presented to the subject aq a test of ability should 
be regarded as a measure of ability if the subject knows what aspect of his 
performance will actually be scored, and as a measure of personality if he 
does not know " 

10 The Water Jar test tends to produce U-shaped distributions rather than normal 
distributions Is this an advantage or a disadvantage? 


v Reliability A coefficient of equivalence for a peifoimance test calls foi 
conelating chffcient trials oi stimulus airangements on the same occasion, 
Some peiformance tests aie quite reliable and otheis aie quite urueliable A 
low coefficient of equivalence may indicate that the proposed test is too buef 
to be a good measure, oi that there is no common charactenstic running 
thiough performance on different items A high coefficient of course implies 
that we are getting an accuiate scoie foi the individual’s standmg at this 
time Table 67 illustrates some leported coefficients of equivalence Evi¬ 
dently one can measuie many traits with very satisfactory precision, but reli¬ 
ability cannot be taken for gianted Users of performance tests often neglect 
to check reliabilities, and it is very likely that unreliability accoimts for die 
failure of many experimenters to discover significant correlations and diffei- 
ences with such tests 

Lengthening a performance test to impiove reliability measurement is 
sometimes impossible In the Water Jar test, foi example, once the subject 
discovers that the solution rule changes for different problems, subsequent 
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TABLE 67. Coefficients of Equivalence for Selected Performance Tests 


Test 

Scores 

Compared 

Coefficient 

Remarks and 
Source 

Rod and Frame 

Rod and Frame 
with Tilting 

Room 

64, 52 

(Witkin, 1949) 

Rod and Frame 

Eight odd vs, eight 
even trials, cor¬ 
rected 

99, 98 

(Gardner ef a!, un¬ 
published) 

Deception tests 


62 to 87 

Scores corrected for 
differences in rel¬ 
evant abilities 

(Hartshorne and 
May, 1928) 

Critical flicker fre- 

Thresholds on six 

98 in 

(Irvine, 1954) 

quency of paretics 
and schizophrenics 

trials 

each group 


Embedded Figures 

Odd vs even 
items, corrected 

.68 to .88 

Subtests composed of 
more-embedded 
figures correlate 
only .35 with easy 
figures (Gardner 
ef a/, unpublished) 


trials are likely to show much less “rigidity.” Theie is no way to leiostate the 
subject’s initial naivete to obtain an equivalent tual. The only way to extend 
the sample of behavior is to find a second task which measures the same 
quality 

If the coefficient of equivalence is laige enough to guaiantee accurate 
measuiement, the coefficient of stability tells whether the quality being 
measured is a stable one Stability is desirable when we intend to interpret 
the trait as a long-standing, geneially significant aspect of personality An 
unstable scoie reflecting mood, tempoiaiy inefficiency of thinking, or the 
like may also be useful, pailiculaily foi testing transient effects of experi¬ 
mental conditions such as stress or drugs. The illustrative lesults in Table 68 


TABLE 68 Coefficients of Stability for Selected Performance Tests 


Test 

Interval 

Coefficient 

Remarks 

Cheating 

Six months 

75-,79 

(Hartshorne and May, 
1928, II, 88-89) 

Cheating 

Early adolescence to 
adulthood 

.37 

(V Jones, 1946) 

Rod and Frame 

More than one year 

86 

(Witkin, 1949) 


leave no doubt that some performance tests measuie chaiacteiistics of con¬ 
siderable permanence 

Correlations with Nonperformance Measures, Positive lelations between per¬ 
formance tests and othei types of peisonality measures are considered evi- 
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dence that the tests are getting at peisonality variables Pemberton (1952a) 
administered several questionnaires to subjects who also took the Embedded 
Figures test There were 29 significant corielations between self-reports and 
EFT, indicating that good perfoimers on EFT describe themselves as fol¬ 
lows- 

I stay m the background 

I am not sensitive to social undei currents 

I am not intei ested in humanitarian occupations 

I do not feel need to apologize for wrong-doing 

I am not conventional 

I have high theoietical mteiests and values 

I am mtei ested m physical sciences 

It appears that m this sample, at least, good EFT performance is associated 
with self-centeredness oi nonconformity We cannot judge the strength of 
this relationship fiom Pemberton’s repoit 

L Amswoitli (1958) correlated ngidity on the Water Jar test with a ques- 
tionnane on insecurity or general life adjustment For 120 students m a Brit¬ 
ish university, the correlation was 24. An impoitant subsidiary finding was 
that the performance of insecure students changed as the test was admm- 
lsteied with various degrees of emotional piessure As stress increased, the 
most insecure students actually showed gi eater adaptation on critical trials 
than they had shown under minimum stress. 

Generality. In one sense, eveiy performance test appears to be highly 
specific. Very slight modifications m conditions of administration, scoring 
piocedure, task, oi sample produce significant differences m the meaning of 
results In the Tilting Room, enois have different correlations and different 
psychological implications depending on whether the trial begins with both 
chair and room tilted to the left oi with the two tilted in opposite directions 
(Witkin, 1949) This has a theoretical explanation, since the degree of con¬ 
flict between cue systems is fai greatei m the second case In the Water Jar 
test, behavior on cntical trials (where the “set” solution works but is unnec¬ 
essary roundabout) has different psychological properties and correlates 
fiom behavioi on extinction trials (where the “set” solution does not work) 
(L Ainswoith, 1958, Back, 1956) 

Most of the peiformance tests of personality are used in many different 
veisions One EFT presents key figures one at a time, another places several 
key figures before the subject at once, in a third, the subject is to look for the 
same key figuie m every item One may keep the key before the subject dur¬ 
ing his seaich, or may require him to remember the key figure while examin¬ 
ing the complex figure We need engineering studies of the promising tests 
to determine what version is potentially most valuable Even where modifi¬ 
cation is desired, it appears necessary that the new form and a standard form 
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be used side by side to evaluate the importance of the change, No one would 
think of adopting a modified Wechsler or Bmet procedure without compar¬ 
ing it systematically with the original, yet the tradition of standardization is 
almost entirely neglected in peifoimance testing of personality 
To summarize the research on generality is next to impossible For some 
tiaits such as “rigidity” theie have been a dozen papeis summarizing corre¬ 
lations, and even the summaries are in disagreement Some conclude that 
there is a general trait of ligidity, some find tlnee 01 foui ligidity factors 
(never the same fiom study to study), and some aigue that the very con¬ 
cept of rigidity is invalidated by the data Levitt (1956) summarized more 
than thirty correlations of the Watei Jar test with other alleged measures of 
rigidity and found that negative results outnumbered significant corielations 
three to one Such a finding must be iegarded as evidence that investigators 
have been too quick to label a test as a measuie of rigidity Among the tests 
which have been claimed to measure rigidity weie an anxiety question¬ 
naire, a questionnaiie measure of ngidity, the California F scale, Wechsler 
Similarities, and mirror wiihng of woids To expect all these measures to 
agree reflects an umeasonable simple view of mental oigamzation 
Luchins (1951), though he populanzed the Watei Jai test, is critical of 
those who aie content to measuie a tiait of “rigidity,” He legalds the test as 
an observation of mental piocess, and would expect its meaning to shift (as 
Ainswoitli found) under difleient conditions Those who seek to measure an 
abstract trait underlying the test peifoimance 

err in assuming that eveiy Emstellung solution to a test problem is 
brought about by the same psychological piocess—namely, ligidity of 
behavior . . Moreover, the alleged rigidity in solving the criticals is 
taken as an indication of rigidity in the respondent’s personality or of 
rigidity m his ego-defense system His behavioi is ngid because he pos¬ 
sesses rigidity. One is reminded of the outmoded belief that a thing 
burns because it has fiie in it Rigidity of behavior is sought for m 
the respondent, it is consideied as lelatively independent of the field 
conditions undei which the individual is opeiatmg 

I do not think that there is anything mheienlly wiong with at¬ 
tempting to determine within a shoit penod oi time, a few hours of 
testing, the probability that an individual will shift his behavioi in leal 
life situations in order to meet changing cucumstances . At the 
piesent time the most fiuitful appioach seems to me to involve intensive 
observation of and experimentation with rigidity of behavior under 
various conditions, if possible suspending biases as to the natuie of the 
behavioi involved, The aim should be to vaiy conditions system¬ 
atically and to obseive what happens As a final step—and not as a first, 
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step as is so common today—one may be able to propose an explana¬ 
tion for such behavior 

Devoted, painstaking exploration of a single type of test does not appeal 
to investigators who want to isolate important individual differences They 
prefer to seek some geneial dimension among several diveise tests They 
are seriously attempting to identify a quality which is “in the peison" yet ac¬ 
counts for behavior over a wide range of situations or conditions This line of 
attack meets many setbacks, one attempt to establish geneiality after an¬ 
other has failed. Yet the approach is not entnely without success, and suffi¬ 
ciently patient revision of tests and theories has a reasonable chance of iden¬ 
tifying significant traits Any consistent positive correlation between two 
superficially dissimilar tests encouiages continued effoit to define and clarify 
the underlying variable. 

Positive coirelations do commonly occur, but there aie puzzling inconsist¬ 
encies Witkin (1949) finds that EFT con elates with Rod and Fiame 64 foi 
men but only .21 for women Gardnei (unpublished data) finds a correla- 
tion of 65 for women but a near-zero correlation for men Small samples, dif¬ 
ferences in technique, and subtle diffeiences in subject motivation all con¬ 
tribute to these inconsistencies One can conclude that perfoimance tests 
geneially have correlations consistent with the theones offered to explain 
them, but the corielations are often low We are fai fiom having the le- 
producible high correlations which would permit us to argue that any two 
performance scores measure the same factor of personality 

Each task has its specific elements, and a satisfactoiy geneial measure 
will have to be built up by combining shoit tnals on vanous tests each 
containing the same common element along with different specifics The orig¬ 
inal Haitshorne-May studies of generality m character suppoit this conclu¬ 
sion They weie die first to cast doubt upon the assumption that geneial traits 
of behavior can readily be measuied by one oi two specific samples Although 
the specific tests weie leliable, diffeient measures of deception con elated 
little with each other The correlation between cheating on a classioom test 
and on the Circle test of coordination was only 50 even aftei conection for 
unreliability These data contradicted the notion that honesty is a unified 
trait which can be measured in any tempting situation. Fuitheimoie, corre¬ 
lations between diffeient charactei tests were so low as to prove untenable 
the view that a generalized "good charactei” accounted for desirable ti aits 
Intercorrelations of honesty, coopeiation, and so on were only about 25 
The “general factor” in character has small influence on any specific behav¬ 
ior (Hartshorne and May, 1930) 

Similar results are found for persistence (Thornton, 1939, MacAithur, 
1955) MacArthur obtained 21 measures on English schoolboys, all the tests 
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being piesumed to have something to do with persistence A veiy large pro¬ 
portion of the mteicorrelations were insignificant, though some correlations 
were in the .50- 60 range The general factor accounted for only 14 peicent 
of the performance on the collection of tests. Among the scores which 
seemed to be good measuies of “geneial persistence” weie time spent in 
completing a magic number square, time spent on a difficult three-dimen¬ 
sional wooden puzzle (Japanese Cioss), and lalings of peisistence by 
teachers and peers, Combining eight scoies could provide a measure of gen¬ 
eral persistence with reliability .79 MacAitliur also found four gioup factors 
One factor linked tests where pupils had a chance to see if their classmates 
were still working (in contrast to those where each had to set his own stand¬ 
ards). This factoi MacArthur named “social suggestibility m situations de¬ 
manding persistence.” Reputation measures formed a gioup factor Two 
more factors weie required to account for persistence on intellectual tasks 
and peisistence in physical tasks It is evident that peisistence is to a large 
degree situational. 

It should not be concluded that tests of specific traits have no place They 
are invaluable for many lesearch purposes, and the findings of such research 
may have practical significance Mailer (J McV. Hunt, 1944) tells us that 
the Hartshorne-May findings on character led one national agency working 
with youth to revise its progiam completely, because the study showed 
that those who had leceived most lecognition in the agency’s character- 
building activities were on the average mosl likely to cheat. This is not 
hard to explain when we consider that striving for lecognition m com¬ 
petition, and working for high scores even in a puzzle test, may stem from 
the same basic feeling of inadequacy. 

Theie are thiee ways to mteipret structuied perfoimance tests. 

® They can be regaiclcd as specific measuies of one type of peiformance, 
defined only by the operations used m measuring When they aie used as 
dependent variables in psychological expenments, any positive findings are 
likely to be of ultimate theoretical impoitance even though the test cannot 
at piesent be interpreted in tenns of geneial attributes 

9 They can be used to measme geneial traits of peisonahty But because 
of the low conehitions among tests of the same supposed hait, any such 
measuiement requires a composite of diverse tasks. Peiformance testing 
of any personality constiuct seems to lequire a “hodgepodge” such as Bmet 
invented to measuie intelligence when no single type of problem proved 
adequate 

9 They can be used singly or in combination to predict practically im- 
poitant variables Any significant findings (see below) would be important 
whether or not the tests could be interpreted in terms of constructs. 
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11. Analyze the story-completion test of persistence, identifying all the factors 
which might cause one fifth-grader to earn a higher score than another. 

12. According to Studies in Deceit (Hartshorne and May, 1928), children from 
homes with low socioeconomic status cheat more on achievement tests than 
other children (r = 49) What factors must be taken into account before con¬ 
cluding that these children are more likely to violate standards of good con¬ 
duct 9 

Practical Correlates Comparisons with socially impoitant cutena have 
been few and unsystematic. As compiehensive as any evaluation piogiam 
has been the work of the An Force, whose conclusions are summarized by 
Melton (1947, pp 848-849) • 

A continuing effoit was made durmg the Aviation Psychology Pio- 
gram to obtain a test of the reaction of the candidate to emotion-pro¬ 
ducing stimuli, eithei dnectly through the application of such stimuli 
as distractions duiing the course of performance on some psychomotor 
task oi mdnectly thiough the measurement of muscular tension or other 
psychophysiological vanables . The available data do not support 
the hypothesis that additional validity for the prediction of success m 
elemental y pilot training accrues to a test situation when verbal tin eats 
and other distractions, including presumably fear-pioducmg stimuli, 
are administered 

Though the Opeiational Stress Test had validities of .20-30, it oveilapped 
so much with ability tests that it made no useful contribution to prediction 

Structured tests have been widely applied m clinical leseaich, and many 
studies demonstrate differences between patients and normals or among 
patients of diffeient types Such results aie difficult to mteipret, diagnostic 
categories have uncertain psychological significance, and lesults can often 
be attributed to diffeiences m cooperation and attention lathei than to more 
fundamental psychological processes Burdock, Sutton, and Zubin (1958) 
summanze much tentative evidence for vanous positive relations, foi exam¬ 
ple, discharge of schizophrenics from hospital is predicted by low flickei- 
fusion thresholds and good performance on the Coloi Word test Confirma¬ 
tion of these results would have obvious practical importance It might also 
lay a basis for theories about qualities which predispose to recoveiy. Accoid- 
ing to Burdock and his colleagues, previous applications of performance tests 
to patients have picked variables too unsystematically, have not distin¬ 
guished conceptual fiom perceptual performances, and have failed to com¬ 
pare complex performances against “baseline” measures of physiological and 
neurological functioning. 

Performance measures of personality are related to social-psychological 
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variables Linton (1955), foi example, measured attitudes before and after 
reading a biased but allegedly authontative article An index based on the 
Rod and Fiame and Embedded Figures tests together correlated 66 with 
change scores Subjects who could not disentangle relevant from irrelevant 
stimuli were most easily persuaded 

Significant predictions of lesistance to propaganda or of recovery fi om 
schizophrenia aie illustrative of the many studies which encouiage hope 
foi practical application of peifoimance tests None of the relationships, how¬ 
ever, is well confiimed Few corielations aie checked by lepeat studies, 
and not infrequently a lepeat study fails to confirm the initial finding. One 
can find no correlation between a performance test and a practical criterion 
that is at piesent well enough established to warrant basing individual or 
administrative decisions on the test. 


OBSERVATION OF COMPLEX PERFORMANCE 

We turn fiom the highly stiuctuied measures of smgle traits to tests which 
assess style of performance in relatively complex tasks 

Problem Solving 

Dunng an ability test such as Block Design one can obseive method of 
attack and lesponse to Lustration Bettei mfoimation is obtained by modi¬ 
fying the test and by specifying precisely what is to be obseived. Goldner 
(1957) defined “method of attack” m terms of two more definite variables, 
“whole-part appioach” and ngidity He used six tests a modified Block De¬ 
sign test, the Aitliur Stencil Design test, m which cutouts of various col¬ 
ors must be supeiimposed to foim a specified pattern, Anagrams I, 
m which the subject builds numeious words fiom a set of letters. Ana¬ 
grams II, lequmng identification of a scrambled ten-letter woid, the Ror¬ 
schach ink blots, and a “Function test” In the last named, the subject is 

asked “What are the possible diffeient uses of-(box, broom, pliers, pa- 

pei)?” Goldnei developed scoring lules for each test Foi the Function test, 
the whole-pait score was assigned aceoiding to whether the answer used the 
whole object (“Put things m the box”) oi broke it into paits (“Use it for fire¬ 
wood” ) In the Block Design test, a “whole” attack is shown by the person 
who turns each block to the coirect face before beginning assembly, and 
then assembles the pattern as a unit, paying attention to symmetry, etc. The 
“part” approach is shown by the person who staits at one comer and adds 
one block at a time, building up the pattern piecemeal. To bring out differ¬ 
ences in appioach, Goldnei made tlnee changes from the original Kobs test. 
He used n regular, nonsquare designs so as to make analysis of the pattern 
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more difficult He presented every design, whether it used nine 01 sixteen 
blocks, in the same size, so that the subject had to decide how many blocks 
to use 

Goldner also judged rigidity In Block Design, for example, rigidity was 
scored if the subject had difficulty in judging the correct number of blocks, 
retained the same attack after a failure, or gave up without finishing a prob¬ 
lem. In the Function test, rigidity was identified with a tendency to give 
many logically similar uses ("as a tool box,” "for mailing,” “to pack things”) 
whereas flexibility was identified with variety 

This technique contrasts with the Water Jar test used as a measuie of ri¬ 
gidity. In that simpler task the scoie reports a paiticular countable symp¬ 
tom. In Goldner’s battery each score is a lating of an assumed mental proc¬ 
ess which can appear m the performance m several different ways The 
more complex task "spreads out” performance so that mental piocess can be 
infeued. 

Goldner found substantial suppoit for the hypothesis of geneiahty of the 
two traits observed The lesults above the diagonal in Table 69 show that 


TABLE 69. Generality Among Tests of Problem-Solving Style 



Inkblot 

Func¬ 

tion 

Ana¬ 

grams 

1 

Ana¬ 

grams 

II 

Stencil 

Design 

Block 

Design 

Total 
"Whole- 
Part” Score 
Minus Par¬ 
ticular Tesf 

Inkblot 


25 

40 

40 

58 

40 

67 

Function 

42 


00 

-02 

-02 

10 

08 

Anagrams 1 

51 

19 


25 

53 

36 

48 

Anagrams II 

10 

25 

-30 


29 

-03 

27 

Stencil Design 

17 

32 

-19 

50 


62 

66 

Block Design 

26 

16 

00 

34 

83 ' 

"-^ 

48 

Total Rigidity score 
minus particular 
test 

54 

43 

06 

30 

58 

54 



Note Coi relations above diagonal are for whole-part scoies, those below diagonal for rigidity 
scores Correlations m boldface are significant 
Source Goldner 1957, p 14 


five of his six “whole-part” scoies aie correlated Each test agrees substan¬ 
tially with the total of the other measures Particularly stiikmg is the con ela¬ 
tion between tasks as dissimilar as Anagiams I and Stencil Design The 
Function test is an exception, whole approach on this test either is unreli¬ 
able or is a different trait from whole approach on the other tasks Goldner’s 
findings on rigidity are quite similar five tasks have marked correlations 
with each other This time, however, Anagrams I is independent 

Concept-formation tests have been specially designed for study of abnor¬ 
malities of thought processes. The Hanfmann-Kasanm test may be taken as 
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an example. Twenty-two wooden blocks of several colors, shapes, heights, 
and sizes are placed on the table. On the hidden underside of each block is 
printed a nonsense syllable. The syllable defines a type of block (all mur 
blocks are small and tall). The exammei tells the subject that the blocks are 
of four diffeient kinds, one of which is named mur, and that he is to discover 
the basis for classification and sort the blocks The subject proceeds as he 
likes to discover the classification, except that he may not invert the blocks to 
look at the name Aftei each trial sorting, the examiner points out one mistake 
and asks for another trial This goes on until the problem is solved and the 
principle of classification is stated Observation shows whether the subject 
uses a logical hypothesis (“perhaps all mur are triangles”), an arbitrary hy¬ 
pothesis, or pure guessmg One observes ability to profit from a correction, 
ability to discard a false set and form a new concept, bizarre procedures and 
verbalizations, and so on, More significant than the numerical score is the 
insight and conceptual thinking displayed The theoiy behind the test is that 
schizophrenics are unable to think abstractly and must respond to each ob¬ 
ject m the envuonment as a separate thing. 

Though clinical gioups diffei on conceptual tests, they are by no means 
infallible diagnostic mdicatois Some biam-mjured patients show normal 
concept foimation, and tests of either schizophiemcs 01 of distuibed patients 
with low ability may easily be misinterpreted as indicative of biam damage 
(Zangwill, in Buros, 1949, p 79). The designers of such tests aigue for im¬ 
pressionistic analysis as the only dependable method of interpretation, but 
reviewers conclude that the advantages of the tests could be retained in a 
strictly objective scoring of processes obseived (A J. Yates, 1954). 

13. Which of Goldner's tests are most structured? Do they correlate more highly 
with each other than with unstructured tests? 


Perception 

The Bender Test. Perceptual tests examine how the subject takes in, rear¬ 
ranges, and reports infoimation One such test is the Bender Visual Motor 
Gestalt Test (or Bender-Gestalt) The Bender, like many other tests we 
have mentioned, grows out of Gestaltist research on peiception (Bender, 
1938) Figuies with different patterns of organization, including those in 
Figure 96, are shown. The tester asks the subject to copy the set, and ob- 
seives his mode of attack and his success 

Each structured personality test was pointed specifically toward meas¬ 
uring some single trait Goldner, using more complex tests, was able to score 
each one for two qualities, approach and rigidity. The Bender cannot be 
characterized as a measure of any one trait or set of traits. Responses of in¬ 
dividuals may differ in a hundred different ways, which the tester attempts 
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obseive, collate, and interpret. Scoring mles have been developed (Pascal 
and Suttell, 1951), but the clmrcian generally attempts a qualitative integra¬ 
tion (see Chapter 19) 

performance may be treated statistically by observing “signs” which char¬ 



acterize some criterion group. Gobetz (1953) listed behaviors which distin¬ 
guished neurotics from normals, for example: 

Upward slope in reproducing rows of dots 
Incorrect number of wave crests 
Counting alond duiing reproduction 
Figures crowded into half of page 
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When Gobetz counted the number of such signs he found a significant differ¬ 
ence 19 percent of a cross-validation group of neurotics showed nme or 
more signs, compared to 4 percent of normals. Many of the signs reported 
by other authois as characteiistic of neuiohcs did not differentiate in his 
study Gobetz concluded that the test could be helpful in locating emotion¬ 
ally disturbed persons, provided that other data were used to confirm any 
diagnosis of maladjustment 

14. Judging from Gobetz' signs, what traits does the Bender measure? 

15 Compare the screening effectiveness of Gobetz' scoring system with that for 
the MMP1 (p. 479) 

The Rorschach Test The Bender is a test of mental efficiency The subject 
is set a simple, objective, and liteial task, any emotional peiturbation of men¬ 
tal processes can impan the performance In the Roischach test, “reality” 
plays a much smaller lole m guiding responses The subject is asked to tell 
what he sees in ten inkblots, blots whose form is so irregular as to peimit in¬ 
numerable interpretations. The blots are calculated to mouse emotional re¬ 
sponse with their bloody reds, ominous blacks, and luminous grays, and 
with their forms suggestive of nursery animals, oveibearing giants, and sex 
organs. While the task as stated is purely intellectual, it also reveals emo¬ 
tional patterns. Patients often become extremely agitated while responding 
to the cards, and emotion is sufficient to damage some responses of seem¬ 
ingly normal individuals 

The technique was invented by Hermann Rorschach, a Swiss psychia¬ 
trist He used the blots for “an experimental study of form perception” and 
found that patients of different types had diffeient ways of responding to 
the blots His diagnostic method, published in 1921, has been elaboiated by 
subsequent investigators, with a shift of emphasis fiom attempted psychi¬ 
atric classification to a descnption of psychodynamics In the United States 
S J. Beck (1945, 1952) has made minoi changes m Rorschach’s scoring sys¬ 
tem and developed crude norms, a moie radical revision of the scoring sys¬ 
tem has been offered by Bruno Klopfei (Klopfer etal, 1954). 

The Roischach is used veiy extensively in clinical testing, even though its 
dependability is seriously questioned by nonclimcal psychologists and by 
some expert clinicians The Roischach became prominent because psychia- 
tnsts have gieat interest in the descriptions of personality it yields In the 
1940 s, when psychologists weie first used m laigc numbers m mental hospi¬ 
tals and treatment centers, the chief duty assigned them by the medical staff 
was the administration of intelligence tests and Rorschachs. As a result, 
tiaining in Rorschach interpretation became a requirement for clinical psy¬ 
chologists 

The interpretation begins with a systematic scoring according to fairly ob- 
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jective rules There are about a dozen major scores, which fall into three ma¬ 
jor categories location, determinants, and content Location scores indicate 
whether the response uses the whole blot (W), commonly perceived subdi¬ 
visions (D), or unusual details ( Dd ), The “determinants” are the shape, 
color, and shading of the blot which the subject takes into account. “Move¬ 
ment ( M ),” for example, is scoi ed when the subject describes humans in mo¬ 
tion, and CF when the response depends on both form and color, with coloi 
the moie significant m determinmg the response The scorer also notes how 
well the response fits the foim of the blot, scoring form quality + 01 — Fi¬ 
nally, the content score notes whether the response refers to persons (H), 
parts of persons ( Hd ), clothing (Cg), etc 

The scoring of four responses will illustrate the piocedure Card X is a mix¬ 
ture of brightly colored foims. Suppose that these four responses are given' 

1 A big splashy print design foi a summei diess 

£ Enlaiged photograph of a snowflake [lefeis to a large megularly shaped 
area] 

3. Two little boys blowing bubbles You just see them from the waist up 

4. Head of a rabbit 

The scoring of these responses (based on supplementary explanation ob¬ 
tained by inquiry) is 


Response 

Location 

Determinant and 
Form Level 

Content 

1 

W 

CF+ 

Art, Cg 

2 

D 

F- 

Nature 

3 

D 

M+ 

Hd 

4 

D 

F+ 

Ad 


Norms can be collected foi Roischach scores, but to be meaningful the norms 
must be based on subgroups of the population rather than people in gen¬ 
eral The pioceduie for testing can also be standardized. Interpretation, 
however, has never been reduced to a systematic piocedure 

The quality of responses indicates something both about the subject’s in¬ 
tellectual level and about the effort and caiefulness he puts into an intellec¬ 
tual task Much is made of the subject’s conti ol ovei his impulses and his 
emotional reactions. In Rorschach interpretation, movement responses aie 
thought to represent imagination and creative impulses arising from within, 
and color responses are thought to represent emotional reactions to external 
stimuli “Form" is equated with ability to take reality into account A person 
who harmonizes form and movement is said to accept and use constructively 
his inner impulses, a person who rarely reports a movement response is re¬ 
garded as lacking m imagination or as repressing it. Heitz (1942) expresses 
a view shared by most specialists in Rorschach method 
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In the final analysis the procedure of the interpretation in terms of 
oilier clinical and test data defies standardization, as Rorschach origi¬ 
nally contended The information gleaned from the Rorschach material 
is piojected against family background, education, training, health his¬ 
tory, past life, qualitative judgments of the examiner and of other peo¬ 
ple, and othei clinical and test data. This is then interpreted m terms of 
the examiner’s experimental knowledge of the dynamics of human be¬ 
havior Final conclusions are made by inference and analogy depending 
upon the expenence, ingenuity, the fertility of insight, and, not to he 
forgotten, the common sense of tire examiner Prolonged and extensive 
expenence is necessary, not only with human personality but with all 
kinds of clinical problems. This last step by definition, therefore, is per¬ 
sonal to the examiner and subjective in view It permits no norms, and 
it eludes all standardization. 

The impressionistic interpretation is constructed through the use of a 
great numbei of interrelated hypotheses about the internal foices and con¬ 
trols which lead to each type of response Each of these hypotheses must 
ultimately be venfied to make mteipi etation tiustwoilhy Successful experi¬ 
ence with individual cases is the chief basis on which users defend the Ror¬ 
schach method There has been considerable foimal research on the hypoth¬ 
eses, but the complexity of the pioblems posed by the tests has made 
comprehensive lesearch exceedingly difficult (Cronbach, 1949, Mary D 
Amswoith, in Klopfer et al , 1954, pp. 405-500), Sometimes the evidence is 
strikingly favorable. 

Rorschach (1921; see 1942, p. 7) said that movement responses are indica¬ 
tive of personalities that “function moie in the intellectual sphere, whose in¬ 
terests gravitate moie towards their mtiapsychic living rather than towards 
the world outside themselves ” This mtroveision interpretation was checked 
by Bairon (1955), who employed a psychophysical technique to obtain an 
M score, like Rorschach’s but much more leliable This score was compared 
with ratings made by clinical assessors using other data. The persons with 
strong M tendencies were described as inventive, having wide interests, 
mtiospective, concerned with self as object, valuing cognitive pursuits. The 
low M subjects were described as practical, stubborn, preferring action to 
contemplation, inflexible in thought and action. Although this supports Ror¬ 
schach’s interpietation, Klopfer’s use of M as a prime indicator of intelligence 
is questioned, there was no con elation between Ban on’s M scoie and objec¬ 
tive tests of intelligence and originality To be suie, the psychologist asses¬ 
sor rated the high M’s as moie intelligent, but in view of the objective test 
results this implies that assessors aie biased toward judging persons intelli¬ 
gent if they appeal “thoughtful ” In another study, similar mixed confirma- 
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tion of Rorschach theory is found. S responses, which interpret the white 
space between die blots, aie presumed to indicate oppositional tendencies, 
and Bandura (1954) found a coirelation of 35 between die S score and rat¬ 
ings of negativism No support was found, however, for the configural hy- 
podiesis that the meaning of this oppositional tendency depends on the M.C 
balance, S in high M subjects implying self-criticism, and m high C subjects 
implying opposition to others. (See also D C Murray, 1957 ) 1 

Hundreds of additional studies could be cited, each dealing with one bit 
of Rorschach theory. The tiend of the results (Benton, 1950, Holtzman 
et al, 1954; Sarason, 1954) is this' 

@ About half of the experimental tests of Rorschach hypotheses give re¬ 
sults consistent with clinical theory The interpretation certainly has “validity 
greater than chance” 

o These confirmations indicate rather small degrees of relationship be¬ 
tween Rorschach indicators and postulated traits. (Bandura's conelation of 
.35 is typical) Many different personality factois and abilities influence 
any one score, and no direct trait interpretation can be made with confi¬ 
dence 

e Some aspects of the theory are definitely incorrect and should be re¬ 
vised. 

Just how adequate the test, with these limitations, is for global impression¬ 
istic assessment will be consideied in Chapter 19. 

There have been attempts to use single quantitative scores fiom the Ror¬ 
schach either as trait measures oi as empnical predictors Sets of “signs” have 
been developed, for example, to identify persons with oiganic brain damage 
(A. J Yates, 1954; Fisher, Gonda, and Little, 1954). Generally, these pro¬ 
posed special formulas prove valueless on cross-validation, either showing 
no validity or having too high a false positive rate to be useful Trait scores 
such as Goldner’s measures of approach and rigidity sometimes correlate 
with external criteria, but the correlations are generally too small to warrant 
use of the Rorschach as a quantitative measure Some investigators feel that 
these deficiencies could be removed by redesigning the test m order to ob¬ 
tain a better sample of behavior. Holtzman (1958) has prepared two paral¬ 
lel sets of 45 blots and developed a scoring system for them The subject 
gives one response to each blot. Score reliability is expected to surpass that 
of the conventional ten-blot test, but validity remains to be determined. 

1 An amusing and possibly profound analysis of the disagreements among validation 
studies m contained in a study by Levy and Orr (1959) on “the social psychology of 
Rorschach validation ” When university psychologists set up experiments to test construct 
interpretations of the Rorschach, their predictions are confirmed 70 percent of the time 
Comparable studies by psychologists working m clinical settmgs succeed only 50 percent 
of the time When working clinicians try to establish criterion validity, however, they 
succeed m 60 percent of their studies whereas academic psychologists are successful m 
only one-third of their attempts to validate the test against clinical criteria. 
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16. Why might a "perfect” accuracy score, 100 percent F+, indicate a personality 
pattern undesirable for many situations? 

17. Do the Rorschach scores related to quality of output reveal maximum ability or 
typical behavior? How does the Rorschach compare with the Binet test in that 
respect? 

18. “Responses to ten inkblots, presented by one tester on one occasion, constitute 
too small a sample of behavior to measure any intellectual or emotional trait 
reliably ” Do you agree? To what extent does Holtzman’s test overcome this 
objection? 

19. It was pointed out (p 131) that lengthening a test improves validity most if the 
score is a pure measure of the quality measured by the criterion In view of 
this principle, would increasing the sample of inkblot responses in Holtzman's 
manner be expected to have much or little effect on validity? 

Group Behavior 

Group Discussion. The Leaderless Group Discussion (LGD) is a systematic 
observation piocedure used to study social behavior (Bion, 1946) A 
group of peisons, perhaps applying foi the same job, are told to discuss a 
ceitain problem (e.g., how to mciease movie attendance). Observers rate 
predetermined aspects of each member’s peiformance. The LGD is unstruc¬ 
tured 1 no rules of piocedure aie established, the topic is left laigely unde¬ 
fined, and the gioup, being stiangers to each other, have no initial friend¬ 
ship or dominance lelations Dui mg the discussion, however, social patterns 
aie quickly built up, and the lole the person plays is presumably similar to 
the lole he is pi one to adopt m natural gioups 
The variables most commonly lated have to do with three traits: promi¬ 
nence, goal facilitation (efficiency, suggesting useful ideas), and sociability, 
Bass (1954) measuies piommence by latmg the following behaviors (on a 
scale from “a great deal” to “not at all”) 

showed initiative 

was effective in saying what he wanted to say 
clearly defined oi outlined the problems 
motivated others to participate 
influenced the other paitieipants 
offeied good solutions to the problem 
led the discussion 

What the test chiefly measures, Bass says, is “tendency to initiate structure 
m an nutially unstructured situation.” 

The effectiveness of the LGD can be evaluated m several ways Stability 
ovei tnals is fairly high, with a week between tests, the correlations lange 
from .75 to 90 Over longer time intervals or with radical changes m the type 
of problem, correlations diop to about .50 The test is measuring some con¬ 
sistent and general aspect of personality. Behavior in practical situations is 
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no doubt determined by many forces other than personality (seniority, 
relative prestige, specifically relevant knowledge, etc.) but LGD scores 
nonetheless have striking piedictive value Bass and Coates (1952) com¬ 
pared LGD scores with ratings by superiors given as much as nine months 
later and found correlations of 40 to 45 Arbous (1955) reports a validity of 
60 for LGD against rated promise of executives m training Suitability for 
the Biitish foreign service as rated after two years on duty was predicted 
(validity 33) by LGD scores at the time of selection (Vernon, 1950) 

The LGD procedure illustrates the advantage that can be obtained from 
systematic observations Social relations are important m personnel assign¬ 
ment, yet it is very difficult to judge validity from questionnaires, letters of 
recommendations, or interviews. The LGD is an economical “worksample” 
of group behavior. By scoring observed behavior it avoids much of the bias 
mheient in summary impressions Army colonels’ ratings of cadet potential 
were much poorer predictors of later merit ratings than were total scores 
recorded by these same colonels acting as obseivers for an LGD session 
(Bass, 1954). 

20. Give reasons for each of the following recommendations by Bass regarding 

LGD technique 

a. Counts of actual behavior (e.g., new approaches suggested) should be 
substituted for ratings of the subject’s tendency to suggest new approaches 

b. Problems should be equally ambiguous to all participants 

C. Examinees tested in a group should all have the same rank 

21. Compare LGD and peer ratings as methods of assessing leadership potential 

Task Leadership The Leaderless Group Discussion is one of a number of 
“worksample” techniques for measunng peisonality which originated in Ger¬ 
man and British military psychology Psychologists selecting officeis thought 
it necessary to observe complex behavior combining intellect, emotion, and 
habit One simple team-performance task devised by the Germans uses two 
pairs of shears linked by rods so that they must move in unison While one 
shear is opening, the other is closmg Each subject opeiates one pan of 
shears, cutting a series of increasingly complex patterns from a sheet of paper. 
The shears are so ananged that if one man goes directly and forcefully at lus 
task, the shears of the other man move m a rhythm which makes accurate 
cutting almost impossible By means of observation, automatic recording, 
and inspection of the product, the tester looks for evidence of initiative, 
dominance, and cooperation (Kunze, 1931) In a group leadership test used 
by OSS, the American wartime intelligence service, candidates were di¬ 
rected to move a heavy eight-foot log, and themselves, over two walls ten 
feet high, eight feet apart, and separated by an imaginary bottomless chasm. 
Observers noted which men took initiative and leadeiship, how they di¬ 
rected otheis, how they accepted orders, and so on. 
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Perhaps the high point of fiendish ingenuity was the OSS construction 
test The subject is assigned to build a five-foot cube with a set of super- 
Tinkertoys, Poles and spools must be fitted together, and since the parts aie 
too laige to be managed by one man, two helpeis aie assigned. After giving 
directions, the tester ostentatiously clicks his stop watch and letreats. What 
the subject does not know is that his helpers are highly trained stumble- 
hums. Kippy is negative, indolent, a drawback Buster is an eager beaver, 
leady to do all manner of things, mostly wiong, and also primed to needle 
the candidate with personal criticism. This is reported as a typical dialog 
(Anon., 1946)’ 

Candidate Well, let’s get going 

Buster What is it you want done, exactly? What do I do fiist? 

Candidate Well, first put some corneis together—let’s see, make eight of these 
coiners and he sure you pm them like this one 

Buster. You mean we both make eight corneis 01 just one of us? 

Candidate You each make foui of these, and huiry 

Krppj Whacha in, the Navy? You look like one of them curly-headed Navy 
boys all the guls are after. 

Candidate Ei, no, I’m not m anything 

Kippy Just a draft dodger, eh? 

Candidate Let’s have less talk and moie work You build a square over here 
and you build one over thcie 

Kippy. Who aie you talking to—him 01 me? Why don’t you give us a number 
or something—call one of us number one and the other numbei two? 

Candidate. I’m souy What’s youi name? 

Bustei • Mine’s Buster and his is Kippy What's yoms? 

Candidate You can call me Slim 

Buster Not with that shining head of youis What do Lhey call you, Baldy or 
CuilyP Did you evei think of wearing a toupee? 

Slim Come on, get to woik 

Kippy He’s sensitive about being bald 

Slim Just let’s get this thing finished We haven't much moie time. Hey, theie, 
you, be careful You knocked that pole out deliberately. 

Kippy Who me? Now listen to me, you-, if this-thing had been 

built right from the beginning, the poles wouldn’t come out For-, they send 

a boy out heie to do a man’s job 

Kippy and Bustei aie psychologists and are in a position to make an ex¬ 
cellent lepoit on the man’s reaction. (The fact that they had served as Army 
pnvates, and that some of the candidates they were pnvileged to torment 
weie generals being considered for special assignment, piobably also set an 
untouchable lecord for job satisfaction among psychologists ) 

22 A field performance test of an NCO’s ability to lead his squad was developed 
by the Army as a criterion measure of proficiency Why are some performance 
tests regarded as measures of personality and some as measures of proficiency? 
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THEMATIC PROJECTIVE TECHNIQUES 

The Bender-Gestalt and the Roischach illustrate one type of piojective 
technique, in which the subject’s style of handling a pioblem is the focus of 
attention These primanly stylistic tests may be contiasted with thematic 
tests, in which the interpieter is especially concerned with the content of 
the subject’s thoughts and fantasies This distinction lesembles that between 
trait questionnaires such as Guilford’s, which focus on response patterns, 
and techniques for studying stimulus meanings such as the Semantic Differ¬ 
ential and the Rep test The stylistic and thematic categones aie not mu¬ 
tually exclusive One can identify 
specific fears or obsessions m the Ror¬ 
schach protocol, and can even inter¬ 
pret "content” of Bender reproduc¬ 
tions through Fieudian symbolism. 

Conversely, mental style is observed 
in the Thematic Apperception Test 
But the stylistic tests generally yield 
richei stylistic information than the 
thematic tests and are lather poor 
sources of thematic information The 
thematic test comes nearer to ex¬ 
amining "the whole person” at once 
than any other testing technique, 
seeking information on emotions, attitudes, and cognitive processes, so that 
it does give a comprehensive, if tentative, portrait of the whole person¬ 
ality 

The Thematic Apperception Test 

The Thematic Appeiception Test of H. A Munay and his coworkeis 
(1938) requues the subject to mteipiet a pictuie by telling a stoiy—what is 
happening, what led up to the scene, and what will be the outcome The le- 
sponses aie dictated by the constructs, expenences, conflicts, and wishes of 
the subject. Essentially the person projects himself into the scene, identify¬ 
ing with a character just as he vicariously takes die place of the actor when 
he sees a movie The TAT consists of twenty pictuies, different pictures be- 
mg used foi men and women. Since two one-hour sessions aie required for 
the full test, investigators often use shoitened veisions. The subject is led to 
believe diat lus imagination is being tested. The mteipieter gives paiticular 
attention to the themes behind the plots The stones may indicate a defeatist 



FIG 97 Cartoon stimulus for a thematic test 
The picture is presented with the statement 
"Here is Blacky with Mama’s collai " (From the 
Blacky Pictures by G S Blum Copyright 1950, 
The Psychological Corporation Reproduced by 
permission ) 
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attitude, concern about overbearing authority figures, or pieoccupation 
with sex In addition to these aspects of response content, the interpreter 
consideis the style, use of die whole picture rather than piecemeal attack 
fluency, concern with accuiacy in fitting the story to the picture, etc 

The interpreter looks at each story m turn, deriving hypotheses from the 
plot, the symbolism, and the style The hypothesis from one story (e g., "This 
man represses all hostile feelings”) is checked against subsequent stories 
The intei preter must decide how much weight to give to each of many con¬ 
flicting indications and must integrate the mfoimation on intellectual pow¬ 
ers, emotional conflicts, and defense mechanisms indicated by the test pro¬ 
tocol 

Only a few illustrations of the analysis can be given here. Caid I of the 
TAT shows a boy, perhaps 10 years old, looking at a violin lying on a flat 
suiface A girl, age 14, with a Bmet IQ of 143, gives this stoiy (Henry, 1956 
p. Ill) 

Bight now the boy is looking at the violin It looks like he might be kind of sad 
oi mad because he has to play Befoie he might have played ball with the other 
boys and his mothei wouldn’t let him He had to go in and play. Looks like lie 
might piactice foi a little while and then sneak out 

Hemy, woiking fiom this and other stones, estimated her IQ at 140, com¬ 
menting on how cleatly the story "takes into account the basic stimulus de¬ 
mands of the picture” and goes on to “entirely relevant elaborations of good 
quality [which] attribute motive and action to the characters.” 
Wheieas this stoiy led more to a study of process than of plot, the story of a 
42-yeai-old cleik is mteipieted thematically (Hemy, p. 145)- 

The stoiy behind this is that this is the son of a vei-y well-known, a very good 
musician and the father has probably died The only thing the son has left is this 
violin which is undoubtedly a veiy good one and to the son, the violin is the father 
and the son sits there daydreaming of the time that he will understand the music 
and mterpiet it on the violin that his fathei had played 

Henry comments that the first sentence shows preoccupation with excellence 
and a conviction that to match tire example is impossible The man dreams 
only of things within himself, and takes no action to carry out his ambition. 

Contiasting with this lather dnect interpretation of a plot as reflecting 
the teller s diives and style of behavior, another stoiy shows the possibility 
of identifying deeper symbolism behind the fantasy. A lecent immigrant, a 
man age 29, tells this story (Henry, p 178). 

A young boy sitting in fiont of a violm spread out on white table, or white linen 
It is not cleai m the expression of the face if he thinks in glorification and admira¬ 
tion of that what the violin and music could hold for him oi if he is bored and m 
disgust with die lesson he has to take and doesn't want 
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Note, says Henry, the emphasis on conflicting alternatives glorification or 
disgust, has to take and doesn’t want. This personality “may well be maiked 
by its attraction to opposites ” The coie of conflict appears to be sexual, the 
basic issue being whether woman can be “both the Madonna and the sexual 
object. . This is an mstance of the use of the violin as a sexual symbol. 
The man is basically preoccupied with some strong emotional issue, hence 
he utdizes form details m a distorting manner [e g,, “violin spread out”] 

. He feels impelled to make a formal heterosexual adjustment as well as a 
conventional social adjustment, even though both are somewhat forced and 
against his will ” 

These excerpts by no means represent the intricacy of a full interpreta¬ 
tion m which stories are compaied with each other and with backgiound in¬ 
formation about the subject For examples of such full interpretations the 
reader is lefened to Heniy (1956) and Shneidman (1951) We should also 
emphasize that such interpretations as Henry makes are—if the psycholo¬ 
gist is properly trained—extremely tentative, and aie discarded unless there 
is suppoitmg evidence elsewhere in the test and the subject’s history These 
illustrations do indicate the individuality of style which TAT responses ex¬ 
hibit, and the variation in the interpreter’s attack At one moment he views 
the performance entirely as an intellectual effoit, at another he treats the 
response as a symbolization of unconscious conflict How he mterpiets each 
response depends upon die story and peihaps upon his own artistic impulses 
of the moment. 

Though mteipretation has been primarily qualitative and impiessionistic, 
it is possible to develop objective scoring systems for the TAT. Theie are 
dozens of common variables whose strength can be observed in almost every 
TAT peiformance perception of authoirty, reaction to extremely difficult 
tasks, originality, lehance on luck and magical intervention The themes 
themselves are often highly individualistic, but common elements can be 
tabulated Shneidman (1951) piesents fifteen TAT scoring systems used by 
vanous clinicians Such scoring reports the peicentage of stones whose out¬ 
come is unhappy, the number of female characters seen as predatory or de¬ 
manding, etc These scoies play a larger pait in research than in clinical 
analysis of individuals Use of TAT scores for diagnosis appears worthy of 
further exploration. Dana (1955, 1956) developed four scores for expressive 
aspects of the performance which separated neurotics, psychotics, and 
normals Mussen and Naylor (1954) validated an aggiession score for TAT 
stories, showing that it correlated with frequency of overt aggressive behav¬ 
ior m problem boys More than that, when the frequency of mention of pun¬ 
ishment m the TAT was used as a measure of fear of punishment, it was 
shown that behavior depended on both aggressive drive and fear Every one 
of the seven hoys with high TAT aggression and low fear of punishment 
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showed high behavioral aggression, only two out of nine with high TAT 
aggression and high fear of punishment were overtly aggressive This is an 
example of the oft-reitei ated punciple that no one personality score is fully 
mterpretable by itself, 

Stability coefficients over two months aie m the range .60- 90 for such 
scores as need for abasement, giving stones with positive outcomes, and 
use of tension-relief words. Though the evidence is scanty, this is a very fa¬ 
vorable indication of the possibility of accuiate measuiement, since the 
strength of needs measured by TAT changes somewhat from occasion to 
occasion and consistency cannot be perfect (Ciandall, 1951, Lindzey and 
Herman, 1955). From these and otliei studies, it appears that the TAT col¬ 
lects sufficient information to permit fairly accurate sconng of traits, if scor¬ 
ing keys are carefully developed toward this end 

23. How many traits are mentioned in Henry’s three interpretations? 

24. Can one regard the frequency of punishment by authority in TAT stories as a 
sample of behavior indicating how often the subject is punished in life’ 

Measurement of Need for Achievement 

The TAT is designed to covei the whole iange of ideas and behavior and 
therefore cannot cover any one topic fhoioughly. While a person obsessed 
with independence conflicts may bring them into eveiy story, most people 
reveal their relationships with authority only on one oi two caids specifically 
designed to elicit such stories. As the examples above show, any smgle pic¬ 
ture is indefinite enough to bring out different types of information fiom dif¬ 
ferent subjects. This flexibility, which permits the subject to reveal almost 
any trait or theme that is prominent in his peisonahty structure, is an advan¬ 
tage m a free-ranging exploiation of peisonahty But it is a serious disadvan¬ 
tage when one wishes to answei a specific question. 

Focused tests are designed to elicit thematic responses all of which bear 
on the same question For example, Muiphy and Likert (1938) carried out 
research on laboi-management conflict by piesenting pictures of stiikeis m 
conflict with police, etc Shapno, Biber, and Minuchm (1957) tested teach- 
eis’ attitudes by presenting caitoon picluies of classioom scenes and pupil 
groups, A focused test for Air Foice personnel was based on the hypothesis 
that outwardly directed aggression would be associated with tolerance for 
high centiifugal forces The criterion was a measure of the foice (number 
of "G”) required to pioduce blackout in a human centnfuge In the best- 
designed of seveial validation studies, the score from the thematic test classi¬ 
fied 18 of 25 subjects correctly as having high or low toleiance (A. J. Silver- 
man et al, 1957). 

The possibilities of the focused thematic test have been most thoioughly 
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exploited by McClelland and bis associates (1953). He selected four pic¬ 
tures (two from the TAT) intended to bung out attitudes toward achieve¬ 
ment The four caids are a woik situation (men at a machine), a study situa¬ 
tion (boy at desk with book), a fatliei-son picture, and a boy apparently 
daydreaming Achievement motivation was scored in eveiy story suggesting 
competition with a standaid For example, 

A workei is pulling a hot plate of metal back in the oven with a pan of tongs in 
order to heat it up again The gentleman beside him is a helpei 

is scored as showing need foi achievement (n Ach) because the reheating 
implies “desire to move ahead to the ultimate goal” (Atkinson, 1958, p 722) 
Detailed scoring manuals have been developed (McClelland et al, 1953; 
Atkmson, 1958, pp. 685-735) 

A second projective measure for the same purpose is the French Test of 
Insight (Atkmson, 1958, pp 242-248). A brief description of behavior is 
given, e.g, “Bill always lets the ‘other fellow’ win”, the subject is to provide 
an explanation The test consists of twenty such items, ten m each form The 
score is the number of times desire for achievement is mentioned as a mo¬ 
tive. 

There is evidence that such projective measuies are getting at a differ¬ 
ent aspect of personality than that shown m other measures De Chaims 
et al (McClelland, 1955, pp 414ff ) made up a questionnaire on desue for 
achievement (called v Ach) This con elated only 23 with n Ach The sub¬ 
ject with high v Ach is concerned with conformity, is defeiential to expert 
authority, and disapproves unsuccessful people. High n Ach, on the other 
hand, is more associated with striving and effectiveness Scores on the 
French test con elated near zeio with peer judgments of motivation to 
achieve (Atkmson, 1958, p. 247) French was able to show that the peer 
judgments of motivation depended heavily on obseived success and thus are 
probably reflections of ability lather than motivation 

Even though it is unrelated to obseivations and self-report, the projective 
test is related to behavior High n Ach is generally associated with striving 
and effectiveness (de Charms et al, m McClelland, 1955, p 421), French 
and Thomas (1958) divided a selected group of highly intelligent subjects 
into those with high and low n Ach on the Insight test and required them 
to solve a difficult intellectual problem of the type described in Chapter 1 
(p. 7). The problem had several acceptable solutions The high n Ach 
group worked, on the average, twice as long as the others befoie givmg up, 
and were much more successful m arriving at at least one solution In the 
highly motivated group, performance correlated .36 with ability, but the 
correlation was zero m the low n Ach group Ability predicts only when men 
are motivated to use that ability. 
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It is hypothesized by McClelland that thematic tests reflect strength of 
motives at a given moment, as well as the average need level of the individ¬ 
ual. Tins is suppoited by the finding that n Ach scores are higher when ten¬ 
sion is raised by ego-involving directions (French, 1955), as well as by stud¬ 
ies with focused tests of sexual, affiliative, and hunger drives. 

Prominent Projective Techniques 

Duimg the decade 1945-1955 there was a wave of indiscriminate enthu¬ 
siasm foi developing new projective techniques. Dozens of approaches weie 
tried, but in most cases the research was so supeificial that the unique meat 
of each proceduie, if any, was not established Only a few of the techniques 
have suivived, and some of them letam popularity only because a cult of 
specialists keeps the test prominent. The following projective tests, m addi¬ 
tion to those pieviously described, are encountered with gieatest frequency 
in the current reseaich hteiatuie or in clinical practice, 

• The Blacky Pictuies, Gerald S Blum, Psychological Coipoiation, 1950 
A set of cartoons involving a small dog, situations aie designed to elicit sto¬ 
nes revealing sexual attitudes (Figuie 97). The pictures suggest various 
types of conflict denved from psychoanalytic theory (eg, castiation anx¬ 
iety) Validation is seriously inadequate. 

• Childien’s Apperception Test (CAT), Leopold Beliak, 1948,1951 (See 
Beliak, 1954 ) A TAT-type instrument for ages 3-10 Piclui es of animals such 
as might be used m nurseiy stones aie aimed to elicit information on feeding 
conflicts, sibling rivaliy, and other childhood problems Validity is presum¬ 
ably very similar to that of TAT. 

• Four-Picture Test, D J. van Lennep, M Nqhoff, The Hague, 1930, 
1948, 1958. The subject is given four pictuies showing two solitaiy figures 
and two social scenes, one stoiy is to be woven around all foui pictures. In 
the hands of an experienced usei, the test should have values similar to those 
of the TAT. (See H H Anderson and Gladys L, Anderson, 1951, pp. 149- 
180, van Lennep, 1958.) 

® House-Tree-Person (IITP) Test, J N Buck, Western Psychological 
Services, 1946, 1950 This is one of several tests in which the subject merely 
executes a diawing. Different inteipi elers emphasize diffeient aspects of 
the production Although claims aie made for successful interpretation in 
clinical cases, caieful validation studies cast doubt upon the specific mterpie- 
tative piinciples offered (Anastasi and Foley, 1952, Fisher and Fisher, 
1950 ) There is no doubt that drawings reflect personality, but there is great 
uncertainty as to how to make sound inferences from them, 

e Make-a-Picture-Story Test (MAPS), Edwin S Shneidman, Psychologi¬ 
cal Coipoiation, 1947 A variant of TAT m which the subject assembles pa- 
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per cutout figures against backgrounds to make his own pictures Presum¬ 
ably similai to TAT in value though even less structuied (Shneidman, 
1951) 

® Rosenzweig Picture Fiustration (PF) Test, Saul Rosenzweig, 1944, 
1948. A set of cartoons in which one figure thwarts another, die subject, tak¬ 
ing the part of the second peison, is to tell how he would reply The PF is 
dierefore a self-report test using focused fantasies The test is objectively 
scoied Though it is of definite leseareh value, theie is no cleai theory for 
interpreting individual scores 

® Sentence Completion Technique, J B Rottei, Psychological Corpora¬ 
tion, 1950 Another version, Amanda R Rohde and Gertrude Plildreth, Psy¬ 
chological Corporation, 1947 The sentence-completion method is one of 
the simplest methods of obtaining information on conflicts either foi screen¬ 
ing of disturbed persons or as a preliminary to interview Unfinished sen¬ 
tences such as "My mother . ” or "When I make a mistake ” aie to be 
completed by die subject Several veisions of the test have been crudely 
standardized Although responses can be consciously conii oiled, the coop¬ 
erative subject geneially gives a useful picture of some of his salient atti¬ 
tudes (Rotter et al, 1949) 

• The Szondi Test; L Szondi, 1937, 1951 (See Den, 1949 ) Photographs 
of patients having various diagnoses aie presented to the subject, who indi¬ 
cates which ones he prefeis. It is assumed that even though die patient does 
not know the diagnoses, his unconscious tendencies to approach one type 
or another reflect Ins peisonal needs Available evidence indicates that the 
Szondi-Deri hypotheses are invalid (Lubm in Buios, 1953, pp. 255-256) 

Some summary may be attempted regarding the value of performance 
tests and piojective tests when used as psychometric instruments, even 
though their vanety requires that generalization be cautious We have seen 
that tests vary greatiy m their degree of focus Some, such as die flicker- 
fusion measure and the box-of-eoms test, sample behavior of an exceedingly 
specific type Composite scoies such as MacAxthur’s general persistence 
scoie or Goldner’s whole-pait score denved fiom several techniques cover 
a broader lange of behavior Likewise, the focused measures of n A ch and 
the Leaderless Gioup Discussion procedure give leliable scores which have 
appreciable correlations with nontest vaiiables When focus is almost com¬ 
pletely removed, as m the TAT and Roischach, it becomes much harder to 
measure any one vaiiable accurately In the present state of performance 
testing, therefoie, these conclusions seem justified 

® Highly stiuctured tests of narrowly defined variables have little useful¬ 
ness today save in development of psychological theory. Such tests are un¬ 
likely to have ultimate practical value except in a composite or battery, or in 
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some rare situation comparable to the use of a color-vision test as a measure 
of occupational aptitude, where a specific test factor duplicates a specific 
task requirement. 

• Less narrowly focused tests have potential value as measuies of general 
personality attributes. The one procedure now known to have practical value 
is the LGD, which is a worksample With composite performance scores like 
MacAithui’s or focused tests like McClelland’s, significant traits can be 
measured. These trait measures are of potential value as measures of de¬ 
pendent and independent variables m research. Because numerous traits 
interact to determme behavior in any situation, any one such measure will 
raiely have a large con elation with any practical criterion 

• Unfocused piojective techniques are poorly adapted for quantitative 
measuiement, although scores can be derived from them. Their chief func¬ 
tion is m impressionistic assessment of individuals, as part of a thorough case 
study employing numerous other sources of infoimation. The next chapter 
will discuss this. 


Suggested Readings 

Bass, Bernard M. The leadeiless gioup discussion. Psychol Bull, 1954, 51, 465- 
492. 

This is a compiehensive account of evidence on the practical validity of LGD 
scores, together with an analysis of the personality and ability factors which 
lead to good LGD perfoimance 

Biher, Barbara, & otheis Problem-solvmg situations Life and ways of the seven- 
year-old New York. Basic Books, 1952 Pp 298-344 

The authors applied four tests of pioblem-solving behavior to ten 7-year-olds, 
primai lly to observe their approach, insight, and leaction to blocking The 
chaptei describes the tests and method of sconng and summarizes results foi 
the group It also describes problem tests used by other mvestigatois 
Elsewheie in the book, case descriptions of the separate children are given, 
so that the information from the performance observations can be compared 
with other data 

Burdock, Eugene I, Sutton, Samuel, & Zubin, Joseph Personality and psycho¬ 
pathology J abnorm soc Psychol ., 1958, 56, 18-30 

Reseaich is described involving nineteen stnictuied tests of tiaits ranging 
fiom the physiological to the conceptual. Preliminary results are given on 
the possible diagnostic and theoietical significance of each test, together 
with critical comments on performance measures previously used in clinical 
woik 

Levitt, Eugene E The watei-jar Emstellung test as a measure of rigidity Psychol 
Bull, 1956, 53, 347-370 

Levitt summarizes numerous investigations of die geneiality of this test He 
shows that the same test takes on different psychological properties with only 
slight changes m design. Levitt’s critique of the psychometric properties 
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of the Water Jar test raises questions applicable to nearly all performance 
tests 

McClelland, David C Performance traits Peisonality New York William Sloane, 
1951 Ep,162-199 

A penetrating discussion of the use of peifoimance tests and ratings to isolate 
useful dimensions for describing typical behavior includes an excellent survey 
of factor-analytic studies of personality 




Assessment of Personality Dynamics 


THE psychometric tiadition isolates separate dimensions of ability and per¬ 
sonality and repiescnts the individual by assigning scores on those dimen¬ 
sions. Othei testeis follow a more aitislic tiadition m which tests are seen as 
just one pioceduie foi gaining insight into a complexly organized system of 
needs, concepts, and peiceptual attitudes The impressionistic inteipieter is 
not pnmanly concerned to amve at quantitative scoies on abstracted di¬ 
mensions He is concerned with the organization of piocesses within the in¬ 
dividual which give unity to his behavioi The impressionistic mterpieter 
asks what “personality structure” (intiapeisonal organization) could ac¬ 
count for the ohscived facts—for the ways he perceives significant others in 
his life, foi the disci epancies between his abilities on vanous tests, for the 
seeming diffeiences between his fantasy needs and his oveit behavior, etc 
Such a coheient pictuie is of great potential value in making decisions about 
a case, though its usefulness depends entirely upon its quality and com¬ 
pleteness, and these m turn depend on the range of information available 
and the astuteness of the clinician’s synthesis of it The clinician’s judgment 
makes full use of his psychological theory and his expenence with other 
cases, but his final portiait of the case is an aitistic reconciliation of diverse 
impressions. 

Almost any psychological test, obscivation, or interview can be used as a 
basis foi such a poitiait Wc have lllustiated this with many examples: Jones’ 
descuption of John Sandeis from the Stanfoid-Bmet (p 187); Kirk’s de- 
scnplion of a type of undeiachiever (p 456), Grayson’s descuption of a pa¬ 
tient fiom the MM PI (p. 491), Osgood’s and Luna’s descuption of Eve from 
the Semantic Differential (p 503), Henry’s thice partial TAT interpreta¬ 
tions (p 570) These examples vary considerably m the extent to which the 
interpictei speculates beyond the observed facts Hemy’s three sketches 
lange from an almost literal description of the young gill’s reasoning meth¬ 
ods to a symbolic translation of the story given by the immigrant man As one 
further example which shows how fieely a clinician employs creative imag- 
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mation, psychoanalytic theory, and even frank speculation, we may quote a 
Bender interpretation given by Max Hutt (Shneidman, 1951, pp 227-233) 
for the same patient whose MMPI Grayson mteipreted 

Hutt first notes that the reproduced figures are airanged in the same se¬ 
quence as m the original, the first six being aligned with the left margin, but 
that the last two drawings are fitted into the right half of the page 

Our first hunches then are this individual has strong orderly, 1 e , compulsive 
needs, tending towaids a sort of compulsive ritual, but tries to deny them [the first 
drawing being displaced away from the margin and the exammei having noted 
that the man draws fast and unhesitatingly] , , and he is oppiessed with 

some (probably) geneiahzed feelings of anxiety and (more specifically) personal 
inadequacy (clings to the left margin and is “constrained” to use all of the space 
available to him on this one sheet) We raise the question for consideiation, at 
once, “How stiong and fiom what sornee is this anxiety and what is his defense?” 
We can speculate, from his use of space, that he attempts m some way to "bind” 
his anxiety, l e, he cannot toleiate it for long or in laige amounts, and that one of 
the features of this young adult’s functioning is the need of conti ol The 

supei-ego is veiy strict 

Here Hutt has looked at the style of the man’s peifoimance, and then has 
tried to infer what inner tensions and defenses could geneiate such a style. 
As he says, these aie hunches and speculations to be checked against other 
evidence m the protocol and against all other information about the patient 
As in most "dynamic” interpretations, Hutt uses Freudian concepts of drives, 
conflict, and defenses 

As an example of more detailed analysis, considei Hurt’s remarks on the 
subject’s leproduction of the rows of circles (see Figure 98). 

the ten diagonal columns of cncles [offei] further evidence of the marked 
variability which begins to appear to be chaiactenstic of this “S ” The examiner 
notes, “Checks number of lows (i e , columns) about two-thirds thiough,” We note 
that the angles of the columns of dots diffei, becoming moie obtuse (fiom the ver¬ 
tical) with a conection towaids the end The whole figure is exaggeiated m the 
lateral plane Togethei, these findings suggest a strong need to 1 elate to people, hut 
difficulty in establishing such lelationships The orientation of the Hist column is 
coriect, so the vauation in "angulation” is not a simple perceptual difficulty "S” 
gets the number of columns correct, but varies both angulation and spacing We 
have evidence, then, for the piesence of consideiable internal tension with an at¬ 
tempt at denial of its existence How can we explain the apparent contradiction of 
the need for order and control with the speed and vauabihty of perfoimance? 

Without giving further details of Hurt’s leasonmg about perceptual style, 
we quote a few of his conclusions for comparison with the MMFI report and 
the opinions of the therapist (p 491) All the following descriptions aie 
used, in a context which explains how they were derived and with what de¬ 
gree of confidence compulsive defenses not effective . . acting out 
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regressive impulsivity breaks through . . possibility of psychotic episodes 
. depressive reaction 

The clinician’s interpretation of scraps and shreds of evidence is daring, 
Gan one really know a person from the minor irregularities of his rows of 
dots? But if Halt’s interpretations of style seem bold, there are more startling 
things to come From statements about defenses and controls, Hutt turns to 



die symbolism of the figures, following certain Freudian ideas about “mascu- 
Ime” and “feminine” designs' 

[In the figuie composed of an open square with a wave form at one comei] 
“S” has increased the veitical sides of the open squaie. . S’s leaction to au- 
tlionty figuies can now be mfeired more completely: he is hostile to such figures, 
unable to expiess this hostility directly, and leucts either symbolically or impul¬ 
sively In lmo with the “acting out” hypothesis, the former is more likely. The 
cuived portion of this figure is enlaiged, flattened out in the middle and leveals an 
impulsive flourish at the upper end Now we may speculate that S’s major identifi¬ 
cation is with a female figure, hut she is perceived as more masculine (i.e., domi¬ 
nant, aggressive) than feminine and is reacted to openly with antagonism It is 
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interesting that the upper portion of the curved figure extends well above its posi¬ 
tion on the stimulus card, and is at least as high up as the vertical lines. Here we 
may conjecture that S’s mother (or suirogate) was shonger psychologically than 
his father, or at least seemed so to S, and that S would like to use his mother (or 
women) to defy his father (or men) 

Such test interpretation is often severely criticized as unscientific In de¬ 
fense of the method, we may note that Hutt is able to give a detailed ration¬ 
ale for each of his inferences, he is by no means allowing his fantasy free 
rein. A much stronger defense is that his description of the patient agrees 
well with the clinical picture given both from the MMPI self-report and m 
the therapist’s notes Word for woid, we find confiimation there of ineffec¬ 
tual defense through obsession, hostile impulses the subject fears to express, 
and so on Other remaiks of the theiapist support some of Hutt’s most haz¬ 
ardous-seeming guesses 1 “The father seems to be a hazy person m the pa¬ 
tient’s life” “He talked of wishing to strangle his mother.” “Had difficulty 
with authority figuies ” While the Bender analysis was not a peifect descrip¬ 
tion, it yielded much better information about the depths of personality 
than one might expect from seemingly wild interpretations of a little task 
suitable foi a child’s drawing exercise 

The fact that clinicians continually have such striking successes with in¬ 
dividuals gives them consideiable right to feel confident m their methods 
and theory. At the same time, the clinical tests larely satisfy the demand for 
systematic validation. If it was difficult to nail down the validity of the Pd 
scoie on MMPI, it is impossible to put into statistical form the evidence for 
such innumerable hypotheses as that enlaigement of the Bender wave-form 
indicates a particular attitude toward one’s mother. 

The dynamic interpretation of the person by means of a single complex 
test or a whole assortment of techniques is frequently called “assessment,” 
to distinguish it from psychometric measurement. Assessment most com¬ 
monly takes one of two forms, The first is clinical analysis, well illustrated m 
the Giayson and Hutt interpretations. The second is prediction of perform¬ 
ance of normal or superior persons assigned to responsible jobs 

Personality assessment of normals grew out of the German military test¬ 
ing of the 1930’s, whose team-performance tests were mentioned earlier In 
the hands of German testers, tests were regarded primarily as samples of 
chaiacter traits such as will power and rigidity. The techniques were 
adapted m Great Britain for War Office Selection Boards When wartime 
conditions made it necessary to select officers from the ranks instead of rely¬ 
ing on professionals from the upper classes, these boards took responsibility 
foi judging ability and character of applicants for commissions In the 
United States, Professor Henry Murray and his associates desenbed die ap- 
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plication of a large number of asessment techniques to Harvard students in 
the ground-breaking Explorations in Personality (1938) During World War 
II, Muriay was asked to select staff for the Office of Strategic Services, the 
foreiunner of today’s Central Intelligence Agency (OSS Assessment Staff 
1948) In that progiam, use was made of gioup discussions, team-perform¬ 
ance tests, stress interviews, observation at meals and social events, peer 
latmgs, projective techniques, and structured tests of many types The range 
of die testing progiam, its claim to penetrate hidden depths, and its cloak- 
and-dagger mystery have piovided vivid mateiial for unsympathetic writ¬ 
ers, see A P. Heibeit’s Number Nine (1952), an attack on Biitish civil 
seivant selection, and Moigans The OSS and 1 (1957) In recent years, 
assessment methods have been used for selecting officers and executives, 
and m numerous lesearch pi ograms 

The principal featuies of assessment procedures aie use of a variety of 
techniques, pumary reliance on obseivations in unstiuctmed situations, and 
integration of infoimation by expcnenced psychologists. No assessment pro¬ 
gram lefuses to employ intelligence test scoies oi othei lelevant facts, but 
the emphasis lemains upon synthesizing these data quasiaitistically rather 
than upon combining scpaiate scores in a statistical formula. 

Our chief concern m this chaptci is to evaluate nnpicssiomstic assessment 
Theie have been many validity studies, some penetiatmg and some super¬ 
ficial We shall leview the best of these. 


VALIDATION STUDIES 

Attempts to Predict Job Performance 

The original assessment piogiams in the military and intelligence services 
were nevet adequately validated, hugely because candidates were scattered 
to fai places and to diveise duties, so that criterion data were lacking Per¬ 
haps the most meaningful figuie is the repoit from the British Army that 
ratings of 500 officers m combat by their noncommissioned subordinates cor¬ 
related .35 with Selection Board latmgs (This validity is corrected to apply 
to the entue range of candidates processed lather than the lestiicted group 
iceommcnded foi commissions—Vcmon and Pany, 1949, p. 125.) These 
Biitish studies also found that assessment may be seriously umehable. One 
gioup of candidates was assessed separately by two boaids, and the correla¬ 
tion between lalmgs was only 67, Reliabilities of .80 were achieved, how¬ 
ever, by teams tiamed to use similar standaids and pioceduies, 

Assessment of British Civil Servants Tlnee-day “house-paity” assessments of 
candidates foi the British civil seivice were validated by Vernon (1950a), 
who collected follow-up data on the men accepted. Though measurement of 
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individual differences among such superior university giaduates is extremely 
difficult, the validity coefficients were encouraging, Grades m a training 
course weie predicted by final assessment rating with a validity of .82 For 
job-perfoimance criteria the validities were 50- 65 

Two hundred admmisti ators were rated by supenois aftei a two-year 
probationary period Vernon gives fifty correlations of piedictois with this 
criterion, and the predictors fall into two distinct gioups Theie are 27 va¬ 
lidities for written ability tests, eveiy coefficient is below 30, and the median 
is about 12 There are 19 coirelations for latmgs made by observers after 
performance tests or interviews, these couelations range fiom 26 to 49, the 
median being 41 Peer latmgs had validities neai 25 for this gioup Evi¬ 
dently the impiessionistxc procedure identified aptitudes the pencil-and- 
paper tests did not 

It is impoitant to note that in this successful assessment study selectors 
had “a clear and agreed conception of what they weie selecting for, based 
on a thorough job analysis” The performance tests weie for the most part 
job replicas of civil service papei work, committee tasks, and group discus¬ 
sions Little use was made of personality theory Piojective tests, field obser¬ 
vations, and stiess interviews were absent or were given minimal attention 
by the lateis 

The VA Study of Clinical Psychologists When the Veteians Administration 
began to suppoit training of clinical psychologists m 1947, it sponsoied a 
piogram of selection reseaich directed by E Lowell Kelly Kelly’s team, 
which included prominent clinical psychologists and experienced OSS as¬ 
sessors, applied “eveiy promising technique and pioceduie, objective, pro¬ 
jective, subjective, clinical and quantitative” to a group of 137 giaduate stu¬ 
dents m a nine-day assessment progiam Supplementary groups were also 
tested Criterion data collected from universities in 1950 included informa¬ 
tion on the trainee’s ability as a therapist, as a diagnostician, and as a student 
of research methods 

During assessment, ratings had been made of general surface habits (eg, 
readiness to coopeiate), underlying personality (eg, characteristic mten- 
sity of inner emotional tension), and potential performance as a psychologist 
in various ides (eg, gioup psychotherapy) Ratings were made by some 
peisons who knew only the situational data, by some who knew only the in¬ 
terview, and by some who had access to various combinations of data, this 
proceduie permitted comparison of techniques Only a partial account of 
the hundreds of correlations can be given here. 

Table 70 indicates how well single test scores or ratings based on them 
predicted ceitam important criterion ratings The correlations, though fre¬ 
quently better than chance, are much too small foi the predictors taken 
singly to be of substantial value in selection or guidance The general ability 
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test was much better than other methods for predicting academic perform- 
ance, and also best for predicting rated clinical competence. The next best 
measure is peer ratings; these had some validity even for the cntenon ratings 
of diagnosis and therapy The Bendei-Gestalt did remarkably well on two 


TABLE 70. Selected Validity Coefficients for Single Predictors of Competence in 
Clinical Psychology 




Criterion Ratings 



Academic 

Therapy 

Diagnostic 

Clinical 

Competence 

Miller Analogies (verbal ability) 

47 

02 

.24 

.35 

Guilford self-report questionnaires 

S {social extroversion) 

.06 

.22 

.23 

11 

T (thinking extroversion) 

05 

.09 

17 

19 

Highest of 13 r's 

l.U) 

(22) 

(23) 

(.19) 

MMPI 

Highest of 9 r's, regular 

scales 

(26) 

(- 16) 

(- 12) 

(~ 16) 

Gough Psychologist key 

.16 

15 

.22 

.25 

Strong VIB 

Group I (creative-scientific) 

.26 

06 

04 

.21 

Kriedt Clinical Psychologist key 

10 

23 

23 

.26 

Ratings from Bender-Gestalt 

15 

02 

.33 

.32 

Ratings from TAT 

08 

16 

.24 

15 

Ratings from performance tests (pooled) 

.19 

19 

.02 

.24 

Self rating 

.25 

-20 

.05 

.00 

Peer ratings 

.13 

.28 

23 

.25 


Boldface correlations arc statistically significant 
Source. Kelly and Fiske, 1951, pp 146 fl 


of the criteria Validities are lowered by the unreliability of enteria and the 
fact that the tiamees had already been scieencd to eliminate obviously un¬ 
suitable candidates. 

Table 71 shows the oveiall validity for impressionistic assessments com¬ 
bining all souices of data The correlations of 46 for academic prediction 
and 38 for clinical competence compare very favorably with the best that 


TABLE 71. Validity of Ratings of Clinical Psychologists Based on 
Comprehensive Assessment 




Criterion Rating 


Assessment Rating 

Academic 

Therapy 

Diagnosis 

Clinical 

Competence 

Academic 

.46 

.09 

.21 

.45 

Therapy 

.24 

.24 

29 

.36 

Diagnosis 

36 

14 

16 

32 

Overall suitability for 
clinical psychology 

.27 

18 

.22 

.38 


Source Kelly and Fiske, 1951, p 161 
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could be expected fiom a statistical combination of scores S8S 

coefficients for diagnosis and theiapy are lowei, as m Table 7 ^ latl Ogs The 
to be explamed simply by the inadequacy of those cut e ° i S p ., e 
even more pertinent explanation is that the assessors, some ^ r °babl ^ 
experience in analyzing personalities of graduate stude had ^ 

whom had studied the personality required m such ioles as S none of 
were unable to make competent ratings on these two s ? S ^°hoth 0r 
The third table indicates how much each pait of tire a SSe 6s ’ 

added to the final judgment (Since this final judgment SSlTler d p I0 

‘ ‘ S ^ 

TABLE 72. Evidence on Change of Validity with Added ) n f 

° r,T >at;. 


Credentials file plus ab|ective tests (one rater) 
Above plus autobiography, proiectives (one 

.36 

rater) 

38 

Above plus interview (one rater) 

All above information (conference of three 

.32 

raters) 

32 

Above plus performance tests (one rater) 

31 

Final pooled judgment of three raters 

33 



SoxmcE Kelly and Fiske, 1951, pp 168-169 


40 

37 

.42 

39 

•37 


three assessors rathei than the whole staff, the figuies e 
with those in Table 71.) Apparently, assessors did j Ust not be matched 
had only the ciedentials file and objective test scores as q When the 
addition of interviews and performance tests This m£ 0r f did with the 
along with the modest validity coefficients for the perf Qr ' l0n ’ c onsideied 
alone (Table 70), does not encourage faith m tlie p ei .£^ aace tests taken 
tions, at least m the absence of psychological job analy s !lI1tVn ce obseiva 
Menninger School of Psychiatry Study A similar problem 
by Holt and Luborsky (1958) at the Menninger Sch 0 o T aS UlVes bgated 
Seveial classes of applicants weie interviewed and eval u ^ychiatr 

tests One principal cntenon was the competence of tlie^ 6 ^ ^ projective 
judged by his supervisor during the residency which com , acce pted man as 
The ongmal assessment employed the usual practice of ^ bis trainm 
on the basis of the assessor s judgment. This judgment i s succefs 

logical or psychiatric theory, which is presumably able to aSe ^ on psycho 
a person functions in challenging situations The results \ 6Va ^ lu de how well 
satisfactory The average validity for the combined mf ^^ 6 Ul general im 
was 27, and that for mterview assessments was ,24. Even ^ 1011 from tests 
fact that the correlations are based only on the resbict ec j a ^ 0, ' Vl ng f or q ie 


grou- 


P who 


were 
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accepted and finished training, these validities indicate a high rate of error. 

Holt and Luhorsky (1958, II, p. 139) examined whethei some interview¬ 
ers or test interpreters were markedly better than others, and although they 
did find differences, they conclude that the evidence “throws cold water on 
a frequently encountered suggestion for improving selection methods' 'p Jric ] 
the inteiviewei who does the best job and have him teach the others how he 
does it’ Even if one entertained the dubious assumption that an inteiviewer 
knows Tow he does it/ and is able to teach tire helpful lather than the er- 
loneous parts of his technique, there is still too little diffeience between the 
predictive performance of the best interviewer and those of the others to 
make such an endeavor worth while” 

One comment by a test interpieter is particularly significant m pointing to 
a central difficulty m assessment' “Reviewing some piedictions on which we 
ened, we were impressed with our conect assessment of many specific 
qualities and our inability to cast these up into proper balance so as to judge 
ability to develop skill as a psychiatrist” (Luboisky, 1954) 

Predicting Emotional Difficulty in Flying Training Just as the Menninger data 
indicate that highly competent psychiatrists may make pooi evaluations, 
predictions made by the best of projective testeis are sometimes no bettei 
than guesses Iloltzman and Sells (1954) asked nineteen clinicians including 
some of the most piominent authorities on piojectivc methods to sepaiate 
aviation cadets into two groups, those who succeeded m flight training and 
those eliminated because they had developed oveit personality disturb¬ 
ances. Each judge classified twenty cases, of whom from eight to twelve 
were successes Tire judge was given the subject’s lesponses to gioup forms 
of the inkblot, draw-a-peison, and sentence-completion tests, a biographical 
inventory, and an inventory of psychosomatic complaints The mean num¬ 
ber of coirect classifications per judge was 10 2, compared to 10 expected by 
chance alone Even where judges were unanimous in their latmg, accuracy 
was 56 pei cent, compared to a chance expectancy of 50 The results were 
not improved by treating single tests as a basis foi judgment oi by consider¬ 
ing only those judgments which the clinician said he felt suie of. 

Assessment of OCS Candidates In contiast to the thiee preceding studies, 
where ratings were made by highly qualified psychological assessors, an 
Aimy study showed that recent graduates of Offieci Candidate School were 
successful m assessing candidates (Ilolmen el al„ 1956) Squads of ten se¬ 
lected applicants weie observed foi two weeks in an assessment center, 
Thcie weie lour assessors, each a recent OCS graduate Leadeiship exer¬ 
cises designed by psychologists as peifoimance tests weie administered by 
these officeis Data collected included ratings by the officers and by the 
other squad membcis, and self-report scores Pass-fail recoids m OCS were 
the criterion 
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For all men combined, average ratings by die assessors had a validity of 
,55, and peer ratings of 58 Ratings based on specific performance tests 
generally had validities between 25 and 50. Self-report tests had essentially 
zero correlations widi success It was lecommended diat a five-day version 
of the procedure be used for selection whenever a reasonably large supply of 
qualified applicants is available. Although the assessment is time consuming, 
there is evidence that the piocedure provides sufficient training and orienta¬ 
tion to increase the man’s chance of success in OCS, and theiefore is eco¬ 
nomical 

IPAR Study of Air Force Officers. The most extensive reseaich on assessment 
methods is the program of the Institute for Peisonahty Assessment and 
Reseaich at the University of California. The Institute was organized by 
D W MacKinnon, one of the ongmal OSS staff, for the purpose of testing 
and improving assessment procedures, particularly as applied to superior 
men Studies have been conducted on student, mihtaiy, and piofessional 
groups, but the only major repoit is of an assessment of Air Force captains 
(MacKinnon el al, 1958; Gough and Kiauss, 1958, Barron ei al, 1958, 
Gough, 1958, MacKinnon, 1958, Woodworth and MacKinnon, 1958) 

This is an exceptionally good test of what assessment can and cannot do 
An expeit staff was assembled, and an enormous range of procedures was 
applied to a laige sample of men for whom several appiopnate criteria were 
later available The staff had a reasonable understanding of the cnterion 
task, having previously earned out seveial studies of mihtaiy personnel Pen- 
cil-and-paper tests (ability measures, peisonahty and inteiest questionnaires, 
biographical data) weie taken by 34S captains eligible for piomotion Of 
these, 100 officers were biought together m groups of 10 for “living-in” as¬ 
sessment For thiee days, they lived with the psychologists, being intei - 
viewed, having a medical examination, taking projective tests, objective tests 
of perceptual performance, and group performance tests, and being evalu¬ 
ated by the staff m informal contacts. In all, theie were 238 “field test” (pen- 
cil-and-papei) scores and 398 scores or ratings from “living-m” assessment 
These scores weie compaied with nine major criteria In all, including analy¬ 
ses of various subgroups of subjects, more than 15,000 validity coefficients 
were calculated 

Such a dragnet search for correlates of officer effectiveness is difficult to 
interpret Just by chance, 5 percent of the variables will show “significant” 
coirelations with any entenon, and it is always possible to invent a plausible 
explanation for such relations The investigators, however, guaided against 
serious mismterpi etation by dividing the sample and confining interpretation 
to results which appeared m sevei al subsamples 

A second senous difficulty is the dubious validity of the criteria Independ¬ 
ent criteria of officer effectiveness correlated m the neighborhood of 30 
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(Barron et al, 1958, p 5). Consequently, test validities cannot be expected 
to use much beyond this level. Even the most valid assessment techniques 
cannot predict unstable criteria. The piobable leasons foi low agreement 
among various criteria are the restricted range of ability m the gioup stud¬ 
ied, and the difference in standaids of judgment employed by various su¬ 
periors. 

The staff ratings in which a global assessment was attempted ('overall 
military effectiveness”) did not con elate beyond the chance level with any 
of the criteria of effectiveness All correlations were below .20 (MacKinnon 
1958, p. 36). Fiom the psychometric field testing, a composite "good-officer 
index” was computed on the basis of a formula developed m pievious re¬ 
search on officeis This composite conelated no higher than .18 with the Air 
Force criteria (MacKinnon, p 28) Three “clusters” summanzmg the assess¬ 
ment ratings had “disappointing” coirelations with criteria of effectiveness 
(median .13) But although impressionistic ratings weie unsuccessful as pre¬ 
dictors of effectiveness, a reanalysis piovided some encouragement When 
flying officers were sepaiated fiom giound officeis, the medians were 21 
and —.02 respectively. That is, the assessois did leasonably well m sizing up 
flying officers—consideung the instability of the cnteiia—and had no success 
at all in evaluating ground officeis (Woodwoith and MacKinnon, 1958, pp. 
11-13). 

Whereas effectiveness was hard to predict, a critenon rating on interper¬ 
sonal relations (uncorrelated with tire cuteiion of effectiveness as an officer) 
was successfully predicted Seveial assessment latings had validities m the 
range 20-30, which is about as good as the cntenon permits. No test score 
had appreciable validity foi this ciitenon; the valid mfoimation came fiorr 
staff appraisals. In geneial, the person seen by the staff as tolerant, con 
forming, and lelaxed was rated by his supeuoi as having good lelations witl 
others (Barron et al , 1958, p 24). 

The validity of psychometric measuies was evaluated for the total office 
group, but not foi flying and giound officeis sepaiately The results vane 
slightly from one ciitenon to the next, but no measuic gave consistent evi¬ 
dence of validity In fact, out of 194 test vanables not a single one coirelated 
significantly with the officer effectiveness latmg in tlnce successive subsam¬ 
ples (Barron et al,, 1958, p 15). A few scattered coirelations in the neighbor¬ 
hood of 30 indicate that tlieic is piomi.se of piedictive value in empirical 
keys based on successful peifoimance (e g , a CPI key based on responses of 
high achieveis) and in self-ratings on adjustment (Gough, 1958, p. 6). 
Rorschach and TAT wcie not useful in assessmg officeis (MacKinnon, 1958). 

Summary. The foregoing studies include the major validations of global 
predictions to date. The most favorable results weie obtained m OCS assess- 
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ment, followed by British civil seivice selection The VA psychologist study 
showed about equal validities foi psychometric piediction and for assess¬ 
ment based on those tests plus a credentials file, with interviews and per¬ 
formance tests evidently adding little. The initial Mennmger study of psy¬ 
chiatrists (their later study remains to be discussed) was less successful, and 
the classification of emotional failures among pilots by piojective tests 
showed zero validity. Common elements m the more successful procedures 
may be noted: 

® There is no evidence that psychological trainmg gives the assessor an 
advantage The best results occurred when officei candidates weie lated by 
recent OCS graduates. The worst, as it happens, weie obtained where the 
assessors weie expert clinicians who, howevei, lacked specific experience 
with the types of candidates and critena under study The clinician’s ex¬ 
perience and theoietical background gives him confidence m the judgments 
he makes but seems not to make his judgments actually superior to those of 
the intelligent untrained observer who knows the job requirements This 
conclusion is suppoited by the fiequent finding that peer ratings have 
validities as good as those of ratings by observers. 

• Structured tests, or performance tests which are veiy near to woik- 
samples of die criterion task, have considerable validity. These tests are di¬ 
rectly mterpieted without use of intervening personality theory and can be 
used by nonprofessional judges. Tests requiring the judge to infer the sub¬ 
ject’s personality structiue and then to predict behavior were larely benefi¬ 
cial in these studies. Group perfoimance tasks including LGD make an im¬ 
portant contribution in predicting criteria where acceptance by one’s group 
is necessary for success They contribute much less to prediction when the 
criterion task calls for individual pei form ance 

» The most impoitant requuement for valid assessment is that the asses¬ 
sor have a cleai understanding of the psychological lequirements of the 
criterion task. The civil service and OCS assessors understood the ability re- 
quuements of the criterion task and made litde effort at subtle psychologi¬ 
cal evaluation The VA assessors and the Mennmger assessors tried to match 
the candidates against their mental pictuies of the successful psychologist 
or psychiatrist The IP AH assessois assumed that tire requirements fox effec¬ 
tiveness were the same for giound officers as for flying officers. These stereo¬ 
types had never been checked by controlled observation of successful per¬ 
formers If such an “obvious” relation as that between spatial aptitude and 
success in geometry is contrary to the facts, it is not surprising that stereo¬ 
types concemmg therapists prove false 

It seems fair to conclude that impressionistic interpretations are often 
used where drey have no validity, The assessor must learn to distrust even 
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the most compelling hunch until it has been independently verified. In 
Kelly’s apt phrase, too many psychological techniques are used on the basis 
of nothing more than “faith validity " 

The generally blaclc picture painted above of psychologists most ambi¬ 
tious effoits at assessment is not, however, the final answer Theie is the 
difficult pioblem of reconciling the statistical evidence with the claimed 
“dmical validity” of assessment techniques We need to identify the sources 
of eiroi m assessment and to ainve if possible at a statement of the condi¬ 
tions under which they aie or can be made profitable. 

1. When a group has been preselected before collection of validity data, correla¬ 
tions are reduced In which of the assessment studies cited are the correlations 
based on groups more restricted than those which would usually be assessed 7 

2. In the OCS study, peer ratings were collected from nine men and assessment 
ratings from four |udges How is this fact relevant to the interpretation of the 
validity coefficients of .58 and 55, respectively? 

3. In which studies did the criterion depend substantially upon ability to make a 
good impression upon and win the cooperation of peers? 


Sources of Error in Assessment 

In order to undeistand the difficulties of assessment, we need to lecognize 
the steps involved m mfoimation gathering and liiicicnce. Figuie 99 com¬ 
pares three types of personnel evaluation inference based on dynamic m- 
teipietation, dnect unpiessiomstic evaluation fiom behavioi samples, and 
psychometric piediction 

Let us begin with the light-hand column, which outlines the stages lead¬ 
ing to (lie ciiteizon Each box distinguishes one stage Between boxes, the 
small type lists some of the sources of enoi which pieclude a perfect coire- 
spondence among the findings at successive stages Time intervenes between 
assessment and ciitenon performance, changes m personality during this 
period reduce the possibility of peifect assessment Job peiformance (2b) 
depends not only on peisonahty but on the specific conditions of the in¬ 
dividual’s job. Given a different supenoi oi different assignments, the man’s 
peifoimance might change The entenon 6d reflects pciiormance (2b) 
indirectly, being affected by the bias and incomplete obscivation of the 
supervisor The sources of enor in the right-hand column imply that even 
with peifect information about peisonahty one could nol piedict job cri¬ 
teria perfectly 

The simplest assessment method is psychometric scoring of behavior and 
application of a “cookbook” foimula to arrive at a piediction (center column 
of Figuxe 99) Reduction of behavior to scores discards some amount of in¬ 
formation The combining formula may introduce erior if it was developed 
under conditions that do not apply perfectly to this new sample. Every stage 




assessment and m criterion 
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from la to 6c and from la to 6d involves an additional opportunity for eiror. 
By this analysis, theie are seven places where error can lower the correla¬ 
tion between prediction 6c and criterion 6d. 

Impressionistic rating from observed behavior is illustrated m die OCS 
study, where value judgments were made directly, without mteivenmg 
dynamic analysis of peisonality Here again there are seven possible sources 
of error between 6b and 6d These may not be damaging to the correlation 
if, as in the OCS study, the bias of the raters lesembles the bias of work 
supervisors. 

Dynamic assessment, in the left-hand column, involves two added stages 
of infeicnce. The step fiom 3a to 4a is hazardous because personality con¬ 
structs aie poorly developed and pooily matched to test behavior. The step 
from 4a to 5a involves equally undependable constructs about die nature of 
the critenon task and about how personality differences affect job behavior 
(Cronbach, 1956). The added links of hazardous mfeience make dynamic 
prediction far moie pione to enoi than the moie conservative predictions 
6 b and 6c. 

This diagiam leads to several suggestions foi validation lesearch and for 
the improvement of assessment If criterion 6d is affected by enois, perhaps 
a better criterion could be obtained A woiksample of job behavioi should 
correlate higher than the latmg critenon with 5a and 3b The psychologist 
may quite accurately predict a high degree of initiative, foi example, and 
yet not he able to predict whedier the man’s (unknown) supeivisor will 
evaluate that initiative favoiably or unfavoiably 

Even more impoitant is the possibility of compaung the mfened person¬ 
ality structure (4a) with peisonality on the job (lb) It will be lecalled that 
Luborsky was able, he said, to judge personalities accuialely and yet was 
not able to judge whether the men would make good m psychiatry If tins 
type of claim is valid, then the weak links of the assessment chain aie m 
translating knowledge of personality into expected behavioi The geneially 
low validity coefficients m studies attempting dynamic assessment indi¬ 
cate that something is wiong along the chain of mfeicnees, but it may be 
that 4a corresponds excellently to lb 

Validity of Clinical Descriptions 

The validity of inferences about peisonality staucture cannot be deduced 
from the assessment studies discussed above, wheie desenptive mfeiences 
were not recoided in a form suitable for verification, and wheie the cuteuon 
was an overall evaluation rathei than a descuption. This evidence must 
come from studies m which inferences are compared with other descriptions 
of personality structure 
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The literature contains a great many impiessive case reports m which clin¬ 
ical analyses agree strikingly with othei data on the individual The Hutt 
and Giayson case descriptions aie examples, it will be recalled that these 
clinical impressions corresponded very well with the therapist’s case notes, 
Neaily eveiy clinician can pomt to cases where projective techniques gave 
him insight into unique features of individuals, features so rare that they 
could not possibly be attributed to chance George DeVos once made a b lin d 
analysis of the Rorschach record of a research woikei and, m leporting on 
the inferred personality structure, commented, “This man ought to be an 
lustoiian He’d be completely happy down m Washington digging minute 
details out of the Lincoln archives” (these bemg a set of century-old docu¬ 
ments which had just been opened to scholais). The man was a speciahst 
m a field Aheie histoncal research is most uncommon—but he actually was 
m Washington at the time the analysis was made, extracting detailed infor¬ 
mation fiom fifty-yeai-old files of a Congressional committeel Such “hits” 
cannot be explained away, and constitute the most persuasive evidence of 
the value of projective methods 

The critical thmkei must ask, however, just how often the descriptions are 
conect; perhaps we hear about only die successful predictions Formal test¬ 
ing of the validity of descriptions is difficult, and few adequate studies have 
yet been leported One method, askmg an informed judge to state whether 
a description fits the individual, is unsatisfactory Judges tend to say that the 
description fits even when it was actually written about someone else. This 
is due paitly to noncnticalness and partly to the tendency to wnte vague 
descriptions that might fit anyone, e g, “prefers a certain amount of change 
and variety” (Sundberg, 1955, Davenport, 1952) 

A second proceduie uses matching Descriptions may be piepared for, say, 
six persons Judges who know all the individuals, or who have folders of case 
matenal on them, may be asked to match each clinical description with the 
conect person, Judges have had success far beyond a chance level m studies 
of this chaiacter (e.g, Vernon, 1935, Henry, 1947, Palmer, 1951), but this 
alone is not an adequate validation method A successful match may be 
made on one aspect of the descnption even if other parts of the description 
aie incorrect, and sometimes tlieie will be a mismatch because of one minor 
eiroi m tire portiait A moie specific technique which indicates the validity 
of paiticulai piedictions is lequired (Cionbach, 1948) 

One method of considerable value when pioperly applied is Q sorting 
Statements may be piepared coveimg dozens of aspects of behavior, These 
statements may be judged as fitting oi not fitting the individual both by the 
assessoi and by others well acquainted with his behavior Various statistical 
methods may then be applied to investigate whether the two descriptions 
correspond The most satisfactory procedure is to correlate all descriptions 
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by the assessor with descriptions of each ease by the criterion judges. If the 
clinical descriptions are discriminating lathei than universally applicable 
die correlation between the two descriptions for the same person will be 
much liighei than that for descriptions of diffeient persons. 

No ideal study of this type is available. Samuels (1952) examined trait- 
by-tiait mterpietations fiom the projective tests taken by the VA psycholo¬ 
gists. He found that latrngs on such tiaits as depiessed-vs.-cheeiful, social 
adjustment, and quality of intellectual accomplishments by mterpieteis of dif¬ 
feient tests (Roischach, TAT, Bender, Sentence Completion) had a median 
correlation of only .05- 08 This m itself indicates that theie is no agieement 
among projective inteipreteis A comparable analysis of TAT data alone was 
made by Hartmann (1949), with lecoids of 35 delinquent and dependent 
boys. Two lateis given a complete case lecoid judged each boy an 42 vari¬ 
ables These “criterion” judges had a median coirelation of 44, which indi¬ 
cates how difficult it is to make tiait judgments When the TAT intei prefer, 
W E Heniy, lated each boy, his coirelation with the entenon judges av¬ 
eraged .16 Haitmann also obtained a con elation indicating how well the 
description foi each boy separately, ovei all 42 scales, agieed with Heniy’s 
evaluation While the median conelalion between the ciitcuon lateis was 
39, Heniy’s median conelalion with these judges was 25 The TAT desciip- 
tion fell shoit paiticulaily in judging aggiessiveness, stability, attachment 
to fadiei, school adjustment, activity m reel cation, and moial standaids It 
did best on taciturnity, self-reliance, and matunty In anothei study Henry 
(1947) compared TAT descriptions with othei souices of data, there was 
essential agieement between TAT and at least one othei source m 83 per¬ 
cent of the specific statements. Regaidmg only 2 peicent was there definite 
disagreement These aie much better lesults than m the Hartmann study 
but tire companson is less well controlled 

The foiegoing senes of findings implies that piojective techniques as cm- 
rently intei preted are not dependable souices of complex descnptions, al¬ 
though some lepoits aie appreciably bcttei than random guesses Then 
value can be improved with fuithei development of seoung reliability and 
interpretative theoiy No evidence is available on the adequacy of person¬ 
ality descnptions from observations in complex peifoimance tests or inter¬ 
views 

IMPROVING THE USEFULNESS OF ASSESSMENT METHODS 

Some critics of projective methods and of impressionistic assessment have 
concluded that these methods are indefensible, and that psychologists who 
depend on them are deluding themselves and those to whom they report. 
Confirmed believers in assessment methods, on the other hand, reject their 
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critics as “methodological bluenoses” (Beliak, 1954) and even deny the rele¬ 
vance of such foimal evidence as is available. Tins denial goes much too far, 
as R, R Holt (1958), a proponent of clinical techniques, says, “If the issue 
were whether some clinicians have made themselves look foolish by claim¬ 
ing too much, then I should agiee- these studies show that they have, and 
unhappily, bi ought discredit on clinical methods geneially ” Smce the claims 
made m the past have frequently been discredited, any would-be assessor is 
responsible for presenting indisputable public evidence of the dependability 
of his judgment, vague claims regardmg successful experience will not suf¬ 
fice. 

On the otliei hand, the evidence does not demand abandonment of assess¬ 
ment methods. Personality testing has had a shortei history than ability test¬ 
ing. With personality as difficult to analyze as it is and with the available 
techniques all open to one serious objection or anothei, it is impoitant to turn 
attention to how assessment techniques can be impioved It is equally neces¬ 
sary to understand just what function each proceduie is best foi Many of 
the attacks on piojective techniques and many of the defensive arguments 
have been based on a misconception of their proper role m the study of pei- 
sonality. 

Improving Test Interpretation 

Projective tests and situational observations have been interpieted by 
means of whatever theoiy a particulai interpreter adopts Some TAT in¬ 
terpreters view the stones as samples of stable traits likely to be shown in 
overt behavioi, some considei the test a measure of strength of motives 
which shift fiom occasion to occasion, and some considei the test as a 
measuie of unconscious and unexpressed drives If the mteipretation of a 
test has not yet been stabilized and verified, no one knows what that in- 
stiument might do at its best Recent years have shown many defects in the 
theories used to inteipiet complex tests, and personality theory itself is un¬ 
dergoing substantial change 

We cannot examine detailed questions about paiticular weaknesses m in¬ 
terpretation, but we can note studies having veiy general implications The 
first is the finding of serious bias in ceitain techniques Soskm (1954, 1959, 
see also Samuels, 1952) asked judges to indicate the piobable behavior of a 
subject, using a multiple-choice test based on incidents from his life. One 
group of judges responded knowing only the subject’s age, sex, and social 
backgiound Other judges had Rorschach and TAT recoids Answers were 
scaled from 4 0 (implying excellent adjustment) to 1 0 (implying severe 
maladjustment) The median of the responses representing what the subject 
actually had done was 2.5 Judges relying on general background data gave 
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a generally favorable assessment (median prediction of his response scaled 
at 3,1), but judges relying on piojective data were unfavorable (median 
response 2.0) On the whole, the piojective inteipreters misjudged the sub¬ 
ject whenever his actual behavior showed good adaptation. In unccitain 
situations, they expected the woist of him. 

This appaient bias toward psychopathology m Rorschach mtciprotations 
is implied also m Roe’s work on eminent scientists, the Mennmgei psy¬ 
chiatrist study, and othei mvestigations Within any gioup of noimal persons 
one encounters records which, considered by themselves, would seem to he 
indicative of gioss emotional disturbance oi psychopathology (Gallagher, 
1955) As one Mennmger assessor commented, “The TAT usually exposed a 
mans weakest points, without giving compensatoiy signs of his strong ones. 

. . To some extent the same thing is true of the Rorschach. . . Latent 
conflicts show up in these tests much moie plainly than the compensating 
strengths; [it is necessaiy] to be vciy cautious in assuming that such 
potential liabilities aie actual if they aie not seen opeiatmg in much moie 
direct fashion” (Holt and Luborsky, 1958, p. 246). 

One reason for the bias is that projective theory was developed lluough 
the study of mental patients without appropnatc control studies of normals. 
Only fairly recently have extensive data on normals and superioi individuals 
been collected. A piojective technique leveals duves and impulses, but it 
does not indicate clearly how they are contiolled. Stiong hostility is likely 
to be an unfavorable sign in a person tested by a clime, m an executive, au¬ 
thor, or school superintendent, the same foice may be harnessed to creative 
and socially constructive activity. Until further reseaicli enables the tester to 
distinguish unchanneled from controlled forces, one must interpret damag¬ 
ing indications with caution Ultimately, we may learn to identify conti ol 
mechanisms as well as disruptive forces thiough projective piotocols. 

Projective and peiformance tests aie not compiehensivc cioss sections of 
personality. On the contraiy, they arc observations m a specific situation, 
and one generalizes to future situations only with considerable lisk. Edith 
Lord (1950) showed that a cold, foibidding female Rorschach examinci 
elicited a high fiequency of unhealthy, uncouliollcd emotional i espouse and 
strongly compulsive behavior fiom the same subjects who gave model ate, 
passive, unimaginative responses when tested by a soft “motliei-figiue.” 
Moreover, delibeiate eSoits on the part of the exammei to be more per¬ 
missive altered the test performance. No test is a measuie of personality m 
isolation, it is always a sample of social inteiaction with a specific other per¬ 
son (Schafei, 1954, Sarason, 1954) There is considerable lisk when one 
generalizes to other social interactions. 

Blind analysis is the custom m validation studies aimed to determine 
what can be done with a single test, but practical interpretation should be 
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based on considerable background knowledge. Indeed, Holt and Luborsky 
wisely recommend .study ol pi ejectivc tests only along with intellectual tests 
and a case history. “If one could give only the Rorschach and TAT, it would 
be belter to give no tests at all rather than spend time with so dubious a 
piospect of satisfactory results Piojective tests give valuable insights into 
personality, but the level of material from which they draw vaiies so much 
from one ease to anothei and its significance is so dependent on a fiamewoik 
of leahstic knowledge about the peison that projective techniques can make 
their pioper coiitnlmlion only when used m conjunction with othci meth¬ 
ods'’ (p 303) 

The statement about "levels from which they cliaw” makes inference to 
one of the most confusing pioblems m mterpiotntion of piojective tests 
Sometimes a stoiv 01 fiee association appears genuinely to inflect deep-lying 
repiesscd conflicts, but one cannot be sme which lccoids 01 parts of lecorcls 
have such hidden meanings. The inteipieter will do well to heed Schafer’s 
warning ( 195 i. ji 150) against “ailntiary, picsumptuous clients to deepen 
interpolation in spite of the patient." 

Still another difficulty which can bo remedied only by improving per¬ 
sonality theoiy is semantic contusion. If the test intei prefer uses words 
which mean dilleient things to cliffeicut people, ho cannot hope that his 
intei pi elutions will he eonfiimed oi that they will he piactically beneficial 
Many of the key woids in dynamic inlei pielations are highly ambiguous. 
Grayson and Tolman (1950) asked psychologists and psychiatrists to define 
such woids as Inzarir and «ggrcs,ucm. Twenty-three of these clinicians de¬ 
fined the aggressive peison as hostile and destiuctive; but 21 of them used 
the word to describe positive', asscitivo, dominant behavior As the authors 
said, “The most striking finding of the study is the looseness and ambiguity 
of many of these terms. . Foi the most pait, the lack of veibal precision 
seems to stem fiorn theoietical confusion in the face of the complexity and 
logical inconsistency ol psychological phenomena. Veibal chscicpancics can 
only be reconciled bv a deepei mulct standing of these undeilymg phe¬ 
nomena winch will letjune many yenis of cuieful, peueliatmg, and analyti¬ 
cal psychological experience ’’ 

Psychological Study of Treatment Situations 

No psychometric tester would willingly introduce a selection plan with¬ 
out first validating Ins tests against criterion information, but the assessor 
has generally made blind predictions. The assessor has picked Army offi¬ 
cers, civil servants, espionage agents, and fliers with no better standard 
than his hunches about the demands of the situations involved. While one 
might excuse such presuniptuousness on the grounds of wartime necessity. 
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when men must be selected according to someone’s best guess, piudcnce de¬ 
mands realistic job analysis and test tryout when cncumstances peimit. 
Nearly all the validations of assessment methods have examined the merit 
of "naive clinical assessment,” which Holt (1958) describes as follows: 

The data used are primarily qualitative with no attempt at objectifi¬ 
cation, their piocessing is entirely a clinical and intuitive mattei, and 
there is no pnoi study of the cuteiion or of the possible ielation of the 
piedictive data to it. Clinical judgment is at every step lelied on not 
only as a way of integrating data to pioduce predictions, but also as an 
alternative to acquaintance with the facts. 

The choice of sound psychometric methods and interpictations has, from 
tire days of Wissler and Bmet, depended upon thoiough empirical follow-up, 
similar follow-up is even more essential in pcisonalily appiaisal, where prob¬ 
lems are more complex. Holt suggests the following as a “sophisticated” clini¬ 
cal method. 

Qualitative data from such sources as inteiviews, life histones, and 
projective techniques are used as well as objective test facts and scores, 
but as much as possible of objectivity, organization, and scientific 
method aie mtioduced into the planning, the gathmng of data, and 
thou analysis All the lcfincments of design that the actual ml tradition 
has furnished aie employed, including job analysis, pilot studies, item 
analysis, and successive cioss-validations Quantification and statis¬ 
tics aie used whmcvei helpful, but the clinician himself is u>tamed as 
one of the prune instruments, with an effort to make him as icliable and 
valid a data-processor as possible, and he makes the final oigam/ation 
of the data to yield a set of piedictions tailoied to each individual case. 

This procedure was applied as well as possible in the second phase of the 
Mennmgei psychiatiist study. Naive assessment, as described above, had 
been applied with inediocie success, validity being 24. In the "sophisti¬ 
cated study” the investigatois examined enough success!ul and unsuccessful 
men to foimulale a concept of the good psychiatiist Specific cues m the 
TAT and othei data were identified, to provide an objective fiamowmk for 
judging the remaining cases Despite tins effort, the resulting scoies or judg¬ 
ments developed in this mannci had no validity when based on a single 
projective test Predictive latmgs based on all data had validities of 57 foi 
one judge, 22 for the second (average, 40) licit and Luboisky conclude 
that the validities fiom the final study aie “impiessive” and lecomrnend 
application of refined assessment methods to the selection of psychiatrists. 
The issue appears still to be m doubt, however The one coefficient of ,57 is 
high, but it gives no assurance that the validity of judges would be con- 
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sistently superun to the naive predictions Moreover, when Veibal IQ corre¬ 
lates .39 with the criterion, it is hard to believe that clinical judgments repre¬ 
sent an improvement sufficient to justify the laboi involved. Even with 
caieful analysis of the personalities of men whoso enteiion scoies are known, 
assessment must oveicome sencms difficulties. 

Every tost iepiesents peiformanee in a highly specific situation, as we 
noted eailioi, then* is a eounteipait pioblem with respect to the criterion. 
A ciltd ion rating is geneiated by the specific interaction of a man and one 
set of duties The pxyehiatiist who might do well with cluldien can be rated 
a failuie if lit 1 pioves unable to cope with hospitalized adults caily in lus 
practical training An unfortunate fiist assignment 01 an incompatible Just 
supeivisoi may develop feelings of incompetence and a bad reputation 
which pi event the man fioni i caching his potential Likewise, the success of 
the psychoanalysis the ,student of psyohintiy often undeigocs is a significant 
but rathei unpiedit table featuie of the situation. Assessment is unely used 
to select men lor umfoini, well-defined jobs. The usual pioblem in execu¬ 
tive appi.usal 01 cluneal evaluation is to judge how the individual will get 
along m an ill defined or unspecified situation Whole many vaiitible con¬ 
ditions uiteivene between prediction and follow-up, high validity cannot be 
hoped foi. 

One might hope foi test data to pi edict the aveiage success of a man in 
many independent situations A statistical comjiosile such as a college giade 
aveiage is a "eomeigent phenomenon” (Langinuu, 1943, L K. Fiank, 
1918). Despite lot by oi unlocks expoi ienoe in single eoiuses, the aveiage 
becomes moie ami nuue stable and homo easiei to pi edict as mtne courses 
aie added All-ioimd popnhuitv is a smulai conveigent phenomenon A pel- 
son’s standing may var\ gieully lioni chinch to office to bowling team, but as 
more gioups aie added lus aveiage “seeks its level ” A phenomenon is said to 
he du.eigni/ it the successive events that cause it to develop aie highly 
mterielated. A landslide is ail example One stone jins anotliei, the two 
moving togelliei dislodge olhets. and soon an mesislihle stieam of debus is 
pouniig downhill 'Ilus Imee is a sum of many sepuiale inovements, but not 
an aveiage nt mdejiendeut events, ltalhei, cweiy added stone is an am¬ 
plification of the ougiu.il iimx einenl, il the fiist stone had not moved, there 
would have been no landslide, 

Piedielinu of diseigeut jilieiiomena is not possible Possibilities can bo 
identified (‘that lull looks loose enough to slide"), hut what will occur can 
be pic’dieted onh on the* aveiage ovei independent situations ( Landslides 
will cost the state* ioad dcpaitment x thousands of dolkns’ ). The social sci¬ 
entist can pi edict accurately how many women m a college class will 
mairy. He* can pi edict much less well whether a paiticular woman will 
mairy. Whelhei she will marry the man she met last night is extiemely 
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uncertain, Successes and failures do not average out, one quarrel at the 
wrong time may end all chance for compensating pleasant expenence Even 
with full day-to-day information, the fate of this possible marriage will be 
unpredictable for many months. 

The assessor can reasonably hope to figure out how many men of a ceitam 
type will succeed in psychiatry He can perhaps judge what any one man 
would do on the average if he could have ten independent careers in psy¬ 
chiatry But any one caieer is like a hoise race, a delay in the starting gate, 
a jam on the track, one mistake by his rider—and the favonte loses, The 
psychiatrist has just one careei, in one group, under one set of demands If 
he establishes a good relation with the significant figuies in this environment, 
the beneficial consequences will rebound through his whole life Yet that le- 
lation depends on chance events as much as on his stable personal qualities 
As William James warned, psychology can establish general expectations 
hut cannot hope to give biogiaphies m advance. 

4 Show that each of these is the result of a divergent phenomenon. 

a. The ceremony was a moving emotional experience 

b. Terry is cooperative, but his brother Mike no one can manage 
e Charles' interest in science is becoming focused on genetics. 

5. How reasonable is it to try to predict each of the following? 

a. Will men of this type respond better to close supervision or to freedom? 

b. How will this man respond to dose supervision? 

c. Will Mark like selling? 

d. Will Mark like this job as salesman? 

6. Defend the statement: "After a certain point in its development, the divergent 
phenomenon becomes predictable ” What sort of information is needed for this 
prediction? What does this imply for the psychologist? 


The Unique Functions of Assessment Procedures 

In the writer’s opinion, assessment techniques have been asked to do a job 
for which they are ill suited It has been necessary to emphasize the exten¬ 
sive and discouraging negative lesults on the use of clinical techniques as 
predictois, hut there is another, more positive evaluation to be made. 

Assessment techniques have three lclated featuies which set them apart 
from conventional psychometric methods Stated simply, these are as fol¬ 
lows. 

They provide information both on typical response patterns and on 
stimulus meanings 

They cover a very large number of questions about the individual 

They provide information about diffeient questions for different individ¬ 
uals 
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Coverage of Stimulus Meanings The psychometric approach is to confront 
the individual with a carefully selected task or set of tasks which lepresent a 
criterion situation in some way. This description applies to proficiency tests, 
to aptitude tests, to questionnaires on typical peiforinance, and to worksam- 
ple performance tests such as the LCD. We saw that even impressionistic 
interpretation of such samples of behavior gave valid piediclions for civil 
service and OCS selection. The essential assumption m this type of testing is 
that we can generalise from a sample of behavior to peiforinance m one class 
of situations. 

A person’s behavior changes fioin situation to situation, however, and 
whenever one must understand the person as a whole, oi must select situa¬ 
tions to fit him, a simple prediction by sampling within one class is impos¬ 
sible. One must begin to learn what situations mean for him. Much of the 
content of an interview deals with situational meanings, attitudes toward 
parents, foimei employers, school subjects, etc. The thematic piojeclive tests 
elicit similar information, though m a more disguised and peihaps less cen¬ 
sored fonti. 

Tire Semantic Diffeienlial is the only psyehometiic technique designed to 
study meanings the pci son gives to significant others, Even this pioceduro, 
though slniclined and quantifiable, is mleipreted impressionistically when 
a single individual is under study. Hence then* is no psychometric technique 
for obtaining inhumation about the subject’s reactions to vaiious pci sons and 
situations—unless one wishes to piepaie dozens of questionnaires or sorts, 
each dealing with one prison 01 situation. While research along the lines re¬ 
cently opened by Osgood and (J. A Kelly may lead to well-conti oiled psy¬ 
chometric techniques, at this time llieie is no alternative to some type of 
clinical assessment if we want altitudinal inhumation coveimg a wide uinge 
of objects. It is unhulunate that there have been no controlled validation 
studies to show just how well such piocedures as TAT and Semantic Differ¬ 
ential identify significant altitudes. Viitually all systematic validation of im¬ 
pressionistic methods has examined their adequacy as measures of traits 
(i.e., of response inhumation). 

7. What value does information about situational attitudes have for a decision 
maker or counselor in each of these situations? 

a. Counseling a couple having marital difficulty. 

b. Appraising junior executives in a corporation. 

c. Evaluating a student on probation because of poor grades. 

8. Why is information about attitudes to diverse situations more important in 
dealing with divergent phenomena than with convergent phenomena? 

9. Evaluate the suggestion that situational attitudes might be assessed by ad¬ 
ministering a large number of scorable questionnaires, each dealing with a 
different attitude-object. To what extent would this overcome the limitations of 
Impressionistic assessment? 
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10. Is the assessment of stimulus meanings really distinguishable from assessment 
of traits? (Example Is “hunger drive’ 1 distinct from “attitude toward food’’?) 

11. List the persons or situations about which attitudinal information would be use¬ 
ful in clinical study of a bright 9-year-old child who is unable to read. 

Bandwidth. Shannon’s “information iheoiy” (1949), developed foi the 
study of electronic communication systems, provides a model for considering 
the second important feature of assessment methods, He distinguishes two 
attributes of any communication system: bandwidth and fidelity. 

Home lecord playeis have made “high fidelity” familial to everyone The 
complementary concept of bandwidth lefers to the amount or complexity of 
information one tries to obtain m a given space or time. The fidelity of le- 
cordmg depends upon the width of the groove, if giooves are ciowded to¬ 
gether to put more music on a record, fidelity suffeis, Fidelity could be im¬ 
proved over piesent standaids by designing record and playback systems 
which would cairy less mfoimation (e g, a 33-ipm lecord lasting only ten 
minutes instead of thuty). With other things held constant, any shift in the 
direction of gieatei fidelity i educes bandwidth, and mciease in bandwidth 
may be purchased at the price of bandwidth In any paiticulai communica¬ 
tion system theie is an ideal compiomise between bandwidth and fidelity 
The record industry settled on the 33-rpm “long-play” record; the FCG al¬ 
lows the FM station a bandwidth of 22 kilocycles 
The classical psychometnc ideal is the instrument with high fidelity and 
low bandwidth (Cronbach and Gleser, 1957, Hewer, 1955, pp 3-19) A col¬ 
lege aptitude test tries to answer just one question with great acemacy. It 
concentrates its content in a very naiiow range, using coiielated items to in- 
ciease reliability Because its parts aie highly correlated, pait scores give 
little information for choosing majors or diagnosing weaknesses. Most other 
excellent predictois such as the LGD participation score and the peex laling 
have similar limitation to one cential variable 

At the opposite extieme, the interview and the piojective technique have 
almost unlimited bandwidth, Wlieieas the aptitude test may devote three 
boms to obtaining just one scoie, the mtcivicwci may covci twenty topics m 
a half-hour, and note an even laigci number of bails In some TAT studies 
ratmgs weie made on more than loily variables, all on the basis of about an 
horn’s testing The individual description adds a dozen oi nioic statements 
about individual traits oi attitudes not commonly encounteied 
There are tests with inteimediate bandwidtlis, and a paiticulai technique 
like the Bmel or MMPI may be used as a nanowband method by some test¬ 
ers and as a wideband method by others. All the validity studies we have re¬ 
viewed substantiate Shannon’s pnnciple, increases m complexity of informa¬ 
tion are. obtained only by sacrificing fidelity The Wechsler Verbal IQ is 
highly valid Patterns of sub test scores are of some but quite limited value. 
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And interpretations of responses to single items, or judgments about ob¬ 
served processes are distinctly untrustwoithy. The most successful combina¬ 
tions of large bandwidth with relatively high fidelity aie the GATB and the 
SVIB, both of which are designed for counseling wlieie many alternatives 
must be considered and useful prediction can be made from about a dozen 
semes 

Extremely large bandwidth is disadvantageous because the information 
becomes too um(‘liable for use Extremely small bandwidth, on the other 
hand, is appiopriate only whcic there is one specific, all-nnpoitant question 
to be answeied, to which all testing ofloit should be devoted While no rule 
can be given specifying the ideal bandwidth for testing, we can point to con¬ 
ditions favoring wulei or nanower bandwidth 

o The fust is the number and relative* nnpoi Liner of decisions to be made. 
If an institution is concerned with a simple dec ision and only one outcome, it 
should concentrate on the information most lelevant to that decision (Ex¬ 
ample a college wishing to admit students who will make good academic 
records, without regaid to values, social 01 emotional adjustment, or proba¬ 
ble post-college career.) If many outcomes 01 alternatives cue to be con- 
sidcied, moie types of information ait* needed and bandwidth must increase. 
Counseling, diagnosis, remedial teaching, and supei vision of professional 
workers generally involve multifaceted decisions. The testing cfToit should 
be balanced to obtain relatively dependable inhumation cm the most im- 
poitant questions or those which are most likely to anso. It is bettci to ignore 
minor questions than to spioad one’s inquiry loo thin (Cionbaeh and Closer, 
1957, p 96) 

a Bandwidth can be gieatly increased when it is possible to confirm or 
reverse judgments at a later time. Lack of fidelity does no harm unless it 
leads to costly cirors Naiiowbnnd instruments are desired for making final, 
irreversible decisions about mipoitant matters (eg, scholarslup awards) 
The wideband technique, on the other hand, selves well as the first stage m 
a sequential measuring opeiulion. As a fiist stage', the wideband test scans 
supeificially a lange of nnpmtunl variables, pointing out significant possibil¬ 
ities for luithei study. In this use the wideband juocedme is used for hy¬ 
pothesis formation , not fm final decisions 

This is the piopei function of the Slicing blank, lot example. It is not a 
highly valid basis for eaieei choice. It is an inexpensive peneil-and-paper in¬ 
terview which gives an excellent preliminary mapping of the vocational 
field Its ease of administration, objective scoring, and norms make it supe¬ 
rior to the unconstrained interview (which has even greater bandwidth). 
Following the test, the counselor uses a mote focused interview to confiim 
high scores and to determine their implications Even this discussion should 
not lead to a final decision. It is better to narrow tire choice to two or three 
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areas, these hypotheses can be tested by enrolling in suitable courses and by 
trying relevant summer jobs. 

Comparable opportunities for follow-up and confirmation of assessments 
or score interpretations exist in virtually every decision except selection Fal¬ 
lible tests can suggest assignments for an employee, treatments foi a patient, 
teaching techniques for a student. Even if the test is little better than a 
guess, it has some value when there is no sounder basis for choice. Since try¬ 
ing out the hypothesis permits verification, and change when the hypothesis 
was wrong, little has been lost. We may say, m sum, that the fallibility of 
wideband procedures does no harm unless the hypotheses and suggestions 
they offer are regarded as venfied conclusions about the individual And of 
course some degiee of skepticism is lequiied m interpreting the score from 
any psychological test, however precise and nanowly focused it may be. 

Impressionistic procedures, and psychometiic procedures m clinical set¬ 
tings, are chiefly used for hypothesis formation. Clinicians bring a Roischach 
interpietation or a Wechsler IQ to a case conference, wheie it is consideied 
along with, other data, and this conference concludes that it is better to try 
one therapy than another. Only where the decision is meversible, as when 
surgeiy is piesciibed or where die patient once classified is forevei left m 
the same pigeonhole, is this use of impressions and imperfectly valid scores 
dangerous. Likewise in executive appraisal 01 school psychology, the icc- 
ommendations of the tester are recommendations about experiments to be 
tried. Unfoitunately, assessors (and psychometric testeis) have far too often 
claimed that their methods give valid final conclusions. This has two bad 
consequences: nonpsychologists expect moie than the assessor can deliver, 
and the psychologist tucs to live up to his claim by giving one description or 
recommendation instead of outlining the reasonable alternatives. 

T2. Defend the analogy of a psychological examination to a communication 
system 

13. Would you characterize the DAT as wideband or narrowband? the MMPI? 
the test of flicker-fusion frequency? 

14. Did the Hartshorne-May tests of honesty serve better as a wideband or narrow- 
band procedure? 

15. Why is sequential testing of hypotheses more important for wideband than for 
narrowband procedures? 

Adaptation to the Individual. Closely lelated lo the foregoing comments are 
the advantages of assessment piocedures for shaping the testing to the in¬ 
dividual The psychometiic testei standaidizes his test to answer a question 
piesumed to be impoitant for everyone The impressionistic tester may vaiy 
the problems and topics covered by the testing to fit the individual The 
psychometric tester tries to standardize every aspect of Ins measuring pro¬ 
cedure, so that precisely the same information is obtamed about each sub- 
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ject The impressionistic tester wishes to obtain whatever information is most 
significant regarding a particulai individual, even if this means asking dif¬ 
ferent questions of each pci son. The flexibly administered interview, the in¬ 
dividualized Rep test, and the unstructuied projective technique elicit idio¬ 
syncratic, personally significant responses for which tlieie is no counterpait 
m psychometiic methodology These responses can only be interpreted im¬ 
pressionistically. 

Meclil (1934) gives several examples of such mlcipietations, which he 
properly regards as the essence of the clinical art One is from the psycho¬ 
analyst Roik (1948, p 263): 

Our session .it this time took the following com sc Aftei a few sentences about 
the uneventful day. the patient fell into a long silence She assuied me that nothing 
was in her thoughts Silence from me Aftei mam minutes she complained about 
a toothache. She told me that she had been to the dentist yesteulay He had given 
hei an injection and then had pulled a wisdom tooth, The spot was hinting again 
New and longer silent e. She pointed to my bookcase m the comoi and said, 
“There’s a book standing on its head " Without the slightest hesitation and m a 
reproachful voice I said, “llut why did you not tell me that you luid had an ablu¬ 
tion?” 

How Reik made this coircct mfeience from the patient’s chain of associa¬ 
tions and silences is not our concern line. The skill is compounded of theory, 
imagination, evpeucnee, and willingness to make (and verify oi discard) 
rash guesses The important point is that this mlerpi elation, which might ac¬ 
celerate appreciably the thciapy, could not possibly have been leached by a 
formal testing pioeedure In the first place, such a piocoduie would be un¬ 
likely to touch upon the particular topic of aboitum. Even if it did, there is 
no “trait” on which the iesponsc could be scored, unless one envisions key¬ 
ing the. MMFI to distinguish evaboition patients from other women—and 
similaily for every other group having conceivable clinical interest. Sec¬ 
ondly, the response cannot possibly be niterpieled by any inultiple-iegies- 
sion or olhei founal pioeeduie. How could one establish frequency tables to 
give the meaning of associations-ubout-a-dentnl-cxtraction-plus-silenee-plus- 
obseivalion-abmit-an-invcrled-book? Tins is a unique datum to bo inter¬ 
preted only by a einative act of applying such a theory as tho psychoana¬ 
lytic hypothesis that tooth extiac turn is a disguised symbol foi birth. This 
extieme example of symbolic communication shows clinical idiographic 
interpretation in its purest form, but unique content is interpreted by every 
assessor. 

The interpreter must likewise deal with the unprecedented when he pre¬ 
dicts response to a specific situation. "Should this child be sent back to his 
mother or placed in a foster home?” is a decision in which statistics cannot 
aid. No experience table can predict from IQ, anxiety level, or anything else 
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whether he will adjust well to his mother. This can be estimated only from 
her particular character, the child’s character, and the precise home situa¬ 
tion. Any decision about this pioblem is likely to be wrong, but that is beside 
the point The decision must be made, and insofai as psychological study of 
the child can impiove the decision, the nsk of error is ledueed In this case, 
impiessionistic appraisal is the best available basis for decision. All the “lit¬ 
tle” decisions that take place fiom minute to minute in therapy and teaching 
are similarly resistant to measurement and statistics In these judgments, 
where the psychometnc testei would have nothing to say, the hints from 
the TAT 01 a case lnstoiy may provide valuable guidance (Meehl, 1954, 

p 120) 

The difference between psychometric and impiessionistic assessment, we 
find, is not that one uses multiple-choice questions and one uses inkblots, or 
that one is compulsively cautious, the othei enatically ovei ambitious The 
two approaches to obseivation and inteipretation are suited to different pur¬ 
poses When clinical testers answer questions for which thcii methods and 
tlieoiy aie badly suited, their answeis aie next to woithless oi at best are 
costly beyond their value When psychometric testeis aie faced with a clini¬ 
cal problem calling foi understanding lather than simple evaluation (eg, 
what lies behind a given child’s anxious withdiawal?) they are unable to 
give any answer at all. Each m his own piopei piovmco will surpass the 
other and each outside his province is neaily impotent. Assessment methods 
have earned a bad name foi themselves by trying to compete with measme- 
ment techniques on their own giound. In the absence of excellent leseaich 
to guide the combination of infoimation, the wideband technique should 
not be advanced as a means of predicting specific, rccuirmg catena. The 
precisely focused instrument, on the otliei hand, should not be exalted into 
the sole approved technique for gathcung infoimation. It is efficient only 
when the decision maker asks the paiticulai question foi which it has been 
designed and validated. Even the TMC must be interpreted impressionis¬ 
tically when one wants to explain a low score, oi to predict pci foi mance in 
a new tiaining piogiam 

We expect an evolution fioin natuiahslic to highly stiuctmcd tech¬ 
niques. Alfied Bmet began to exploie and define intelligence by means of 
impiessionistic mterpietations of imaginative pcifoimance It was only alter 
this study had disclosed the impoitant vanables that he began to design the 
stinctured tests fiom winch all later ability tests spiang The peisonality 
questionnaire developed out of psychiatnc observations of symptoms, and 
the inteiest questionnaire out of counseling interviews. Pine naturalistic ob¬ 
servation is always the first step in science, followed by gradual structuimg 
of the observations, and ultimately by definition of specific variables and 
quantitative measurement Whenever the importance of some treatment 
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and critenon becomes great enough to warrant quantitative measurement, 
formal psychometric procedures can be developed to measure them. The 
psychometnc method can give increasingly moie refined and trustworthy 
answers to any recurrent question than can an exclusively naluial-lnstoiy ob¬ 
servational approach, because successive stages of research eliminate sources 
of eiror. 

This does not mean that impressionistic methods will oi should ultimately 
disappear There will always be unique problems to deal with and unique 
facts to interpret. Indeed, since cvoiy person is unique m certain ways, each 
case will piesent some pioblems winch sue beyond the leach of standard 
interpretative formulas. Moreover, treatments will continually change, and 
judgments about assignment to the new treatments will have to be made 
without waiting for years' of follow-up research Even the giovvth of psycho¬ 
metric testing creates a demand for giealei and mme skilllul use ol wide¬ 
band techniques'. As more and more specialized stales are developed for 
measuiing aptitudes, liaits, and situational meanings, it will become even 
moie important to have suitable wideband procedures for use as a first stage 
to deteimine which of these psychometric scab's are lclevant to each poi¬ 
son. Psychometric and impressionistic testing piomluics will always be 
needed to supplement cat'll other. 

Here again we find illustration of the pimuplc introduced very early in 
this book: one cannot identify one gioup of tests as good and lecommend it 
foi use. Foi every type of decision and for every type of psychological infor¬ 
mation, there me many techniques and many specific instruments The in¬ 
struments clilfei in practicality, m the degree of turning lcquircd to use 
them, in the variety of information they obtain, and m fidelity. The instal¬ 
ment that works’ host for one tester will not be best for another testei mak¬ 
ing the same decision Tests must be chosen by a highly qualified profes¬ 
sional woikei who has a thoiough understanding ol the institution and 
persons lie serves. 

All in all, psychological testing is an accomplishment its developers inay 
well boast of Errors of measurement have been ledueod yeai by year, and 
the significance of tests lias been increased, until today all facets of Ameri¬ 
can society feel the impact of the testing movement. The school, industiy, 
mairiage, governmental policy, and character-building agencies have all 
been aided by tests. Interpretations oi test data aie daily creating better 
lives by guiding a man into a suitable hfewoik, by placing an adolescent un¬ 
der therapy winch will avert mental disorder, or by detecting causes of a 
failure m school which could turn a child into a beaten individual Methods 
are now available which, if used caiefully by lesponsible interpreters, can 
unearth the talents m the population and identify personality aberrations 
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which would cause those talents to be wasted. Building on these techniques, 
we aie in a position to capitalize as never before on the richness of human 
resources. 
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Ability, 29 

Ability testing, see General mental ability, 
etc 

Absolute measure, 'll 

Academic aptitude, see College success; 

Educational success 
Acceptability of tests, 142 
Accounting, selection and guidance, 408, 
486 

ACE Personality Report, 507 
ACE Psychological Examination, 229, 816, 
335,347 

Achievement tests, 31,360-400 
summary list 382-384 
Acquiescence, 372,446,451 
Adjustment inventory, 465-406 ff 
Administration of tests, 37-63, 167, 185, 
497-499 

Adulthood, change of ability in, 196 
change of personality in, 488-489 
Age changes in ability, 176-179, 195-197, 
209 

Aiming, 273,275,302-303 
Air Force personnel, prediction of success, 
281,303-314, 328,*340-341,352, 420, 
557,580,587-588 

Allport-Vernon-Lmdzcy Study of Values, 
35 

Allport-Vcmon Study of Values, 140, 489 
Ambiguity, in test interpretation, 597 
in lest items, 444-445,508 
Amencan Board of Exammcis in Profes¬ 
sional Psychology, 10 

American Psychological Association, 11, 22 
-Anecdotal records, 536-538 
Anxiety, as construct, 104,477,494 
effect on performance, 54-56 
Apparatus differences, 309 
Apparatus tests, 301-314 
Apparent movement, 546 
Application of principles, 368-370, 375, 
377, 379 

Aptitude batteries, summary list, 291-292 


Aptitude test, 31,320 

See also Clcucal aptitude, etc 
Architecture, prediction or success, 316 
Aimy Alpha, 162,229 
Army Cenerul Classification Test, 43 
Army personnel, picdittion of success, 73, 
162,217,281,586 

AiLlmi Point Scale of Peifiumanee Tests, 
208 

Artistic abilities, 314-318 
Assessment, 578-608 
somces ol ernn in, 506-510, 590-600 

Bandwidth, 002 
Basal age, 169 
Base rate, 358,478-479 
Bell Adjustment Invenloiy, 480 
Bender Visual Motor Gestalt TesL. 560-562, 
579-581, 584 

Bennett Test of Mechanical Comprehen¬ 
sion, 14,39-41, 77, 87,105,126 
evaluation, 150-151 
validity data, 118,121,316 
variance analysis, 130,138 
Bernreulei Peisonality Inventory, 418, 480, 
489 

Bias in idling, 66,347,508,591 
Bilingual cluldieii, 183 
BillelL-SLair Youth Problems Inventory, 495 
Bluet scales, 160 

Sec also Slanfcnd-BineL 
Biographical Data Blank, 328, 341 
Blacky Pictures, 569,574 
Block Design Test, 41-42,82, 558-560 
Brain damage, 206, 556,560,505 

Calibration, 92 

California Achievement Tests, 382,387 
California First-Year Mental Scale, 212 
California Psychological Inventory (CPI), 
491,495 
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California Test of Mental Maturity 
(CTMM), 110,229 
California Test of Personality, 495 
California Tests in Social and Related Sci¬ 
ences, 382 
Capacity, 164, 181 
Cattell Infant Intelligence Scale, 212 
Character tests, 542-550 
Children's Apperception Test (CAT), 574 
Classification, 18,356 

Clerical aptitude, 143, 270, 273, 332-333 
Clerical workers, selection and guidance, 
116,330-333,422 
Client-centered counseling, 293-297 
Clinical descriptions, validity, 592-594 
Clinical interpretation, 455-456, 579-581 
See also MMPI 

Clinical psychologists, prediction of success, 
425-426, 583-585 

Clinical vs statistical interpretation, 340- 
348, 590-608 
Coaching, 56-59 
Cognitive style, 544 IT 

College admission tests, 37, 57, 173, 383- 
384 

College Qualification Tests (CQT), 229 
College success, prediction of, 72-73, 116, 
226-228, 234, 316, 347 •*. 

See also Architectuic, Mathematics 
Columbia Mental Maturity Scale, 207 
Combat effectiveness, 330 
Combining tests, methods for, 339-348 
Commonality, 251 
Comparable forms, 145 
Compensation, 344 

Complex Coordination Test, 30, 254, 304- 
300, 310-312, 313, 341 
Concept-formation tests, 558-560 
Concept Masleiy Test, 230 
Concunent validation, 104-119 
Conligural prediction, 344-346,493-494 
Constiuct validation, 104-306 
Contamination of entona, 352 
Content validation, 104,106, 304-308 
Convergent phenomenon, 599 
Cooperative Achievement Tests, 316, 322 
Cooperative School and College Ability 
Tests (SCAT), 141,230,236 
Cooidmation, psychomotor, 303-306, 308 
See also Complex Coordination Test 
Conelation, 110-115 

computing guide, 111, 124 
multiple, 339 ff 
Costs of testing, 146-147, 309 
Counseling, client-centered, 293-296 
prescriptive, 297-299 

Crawford Structural Visualization Test 93- 
94 

Criterion, 103, 108, 329-331 
Critical incident technique, 326 


Critical score, 334-338, 342 ff 
Crossvalidation, 355 

Culture, effect on test score, 182-185, 203- 
204, 217, 237-243 
Culture-Fiee Intelligence Tests, 230 
Curriculum and achievement tests, 362— 
363, 396-400 

Cutting score, 334-338, 342 ff. 

Davis-Eolls Games, 230, 240-242 
Day record, 531 
Decisions, individual, 284 
institutional, 284, 324-358 
types of, 17-20 

Delinquents, test peiformance, 188, 205, 
481 

Dentisliy, prediction of success, 280, 306 
Detroit First-Grade Intelligence Test, 223 
Deviation IQ, 171 
Dexterity, 73, 303, 307, 341 
Diagnosis, educational, 390-392 
psychiatric, 484-485 
Diagnostic Reading Tests, 390 
Difference, leluibilily of, 287 
Diffcienlial abilities, 269-292 
Differential Aptitude Tests (DAT), 40, 71, 
88-89, 92, 269-271, 275 fl 
case record, 80, 290, 291 
validity data, 118, 278, 334 
Differential validity, 357 
Difficulty, 134-135 
Dh actions for tests, 37-49 
Distribution, 76 11, 135 
normal, 83-85,135 
smoothed, 82 

Divergent phenomenon, 599 
Division on Child Development, 521, 632 
Dominance, tests of, 468 
Draw-a-Man Test, 207 
Durrell Analysis of Reading Difficulty, 390 
Dynamic interpretation, 455-450, 479-581, 
592 IT 

Educational success, prediction of, 173, 
242, 277, 320 
See also College success 
Edwards Pcisonal Piefereiieo Schedule, 
450-451, 487, 490 
Einstdlung lest, 547 if 
Embedded Figures test (EFT), 547-558 
Emotion, effect on test performance, 54 ff„ 
179-180, 205 
Empathy, 319-320 

Empirical keying, 328, 406-408, 427, 456- 
459, 468, 512 
Empirical method, 103 
Engineering, selection and guidance, 321 
334-337, 345, 407,420-423 
Equal-appearing units, 71, 385-387 
Equipercentile method, 92 
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Equivalence, coefficient of, 137,140-142 
Equivalent forms, 145 
Error of measurement, 126 ff, 288 
Essay test, scoring of, 65 

See also Recall vs recognition 
Essential High School Content Battery, 
383 

Ethics of testing, 11-13, 459-462 
Evaluation and Adjustment Series, 383 
Evaluation form for tests, 147-152 
Evaluation of treatments, 19 
Examiner, effect of, 60-64,596 
Sec also Tester 
Expectancy table, 72, 387 
Extrinsic validity, 58 

F scale, California, 446 
Fagade, 446-451, 453 
Face validity, 143 
Factor analysis, 247-268 
Factors, ability, 256-264 
interest, 410, 436 
personality, 467 
psychomotor, 307-308 
Faking, 446-449,458,513 
False positive, 334, 478 
Fatigue, 43 

Feeble-mindedness, 109,173,205 
Fels Parent Behavior Rating Scales, 526- 
528 

Fidelity, 602 

Field observations, 33, 440-441, 528-538 

Flanagan Aptitude Classification Tests, 292 

Fliaker fusion, 546, 552, 557 

Forced choice, 450-452, 512-516 

Foreign language aptitude, 320 

Forms, comparable, 145 

Four-Picture Test, 574 

Free responses, scoring of, 65 

French Test of Insight, 573 

Frequency distribution, 76 ff, 135 

Frustration, 33, 540-541 

General Aptitude Test Battery (CATB), 
82, 272-276, 280 If , 342 
Geneial educational development, tests of, 
235,380-383 

General factoi, 215, 250, 258 
General mental ability, 30,164, 243-240 
as factor, 215 
group tests, 214-243 
summary list, 228-233 
lustoiical background, 157-163 
individual tests, 157-208 
summary list, 206-208 
overlap with achievement, 223-225 
predictive validity, 176-181 
preschool tests, 183, 208-212 
spectrum of test types, 235-237 
Generosity error, 506 


Gesell Developmental Schedules, 212 

Gifted persons, 173-174 

Gordon Personal Inventory, 496 

Gordon Personal Profile, 116, 496 

Grade average, prediction of, 72-73,116 

Grade norms, 385-387 

Graves Test of Design Judgment, 318 

Griffiths Mental Development Scale, 212 

Group behavior, observations of, 566-568 

Group factor, 250 

Group lest, 35 

Guess Who test, 519 

Guessing, 49 

Guidance, differential ability tests, 269-291 
interest tests, 431-436 
See also Counseling 

Guilford-Shneidman-Zunmemian Interest 
Survey, 437 

Guilford-Znnmeiman Aptitude Survey, 292 
Guilford-Zimmerman Temperament Sur¬ 
vey, 496, 584 

Halo effect, 508 

Iluggerty-Olson-Wickman Behavior Rating 
Scale, 511 

Hand-Tool Dexterity Test, 306 
Handwriting, 65-66 
II.infmann-Kasamn lest, 559-560 
Henmon-Nelson Tests of Mental Ability, 
220, 221, 222, 230 

Heston Personal Adjustment Inventory, 481 
High school testing progimn, 243, 432, 498 
Historical background, 157-103, 394-396, 
464-469, 581-582 

Ilolzingei-Crowdei Uni-Factor Tests, 292 
Homogeneity of test items, 215, 327-328, 
412-414 

See also Equivalence 

Honesty, tests of, 542-544, 552, 554, 556 
Horn Art Aptitude Inventory, 315 
IIouse-Tiee-Person (I1TP) Test, 574 
Humm-Wadsworth Temperament Scale, 

468 

Hypotheses, verification of, 20, 121, 494 

Idiogiapliic analysis, 499-504 
IER 1 rimmmg Test, 300 
Illinois Ait Ability Test, 316 
Impressionistic Lesting, 24-28, 63, 340-348, 
564,579-607 

Incentives and Lest pciforinanoe, 52-53 
Indians, test pcifonmince of, 184 
Individual decisions, 284 
Individual test, 35 

Industrial apjslieations of tests, 9, 118, 228, 
306, 342, 393, 460 

Industrial arts, prediction of success, 278, 
306,340 

Infant development, tests of, 208-212 
Information theory, 602 



INDEX OF SUBJECTS 


647 


Institute of Personality Assessment and Re¬ 
search, 587 

Institutional decisions, 284, 324-358 
Intelligence, 160,164, 244-246 
social, 319 

See also General Mental Ability 
Intelligence quotient (IQ), 102, 170 il 
distribution, 171-173 
interpretation of, 173-174 
stability, 176-179 
as standard score, 171 
Interaction recordei, 536 
Interest inventories, 405^439 
interpretation, 428-434 
stability of scores, 418—419 
sunnnaiy list, 437 
Internal consistency, 141-142 
Interpretation, dynamic, 455-456, 579-581, 
592 ff 

to subject, 431-434, 487 
Intervals, equal, 71, 385-387 
Intrinsic validity, 58 
Introvcision, tests of, 466-468 
Inventory, sec Personality, Interest 
Iowa Tests ot Basic Skills, 383 
Iowa Tests of Educational Development 
(ITED), 383 
Item loim, 371 

Items, selection of, 364-367, 406-408 
Job analysis, 325-327 

Job performance, prediction of, 116, 217, 
225-228, 279-280, 281, 306, 312-314, 
342, 485 

Job replica, 304-306, 312-314 

Job satisfaction, 420-425 

Judgment, errors of, 346-348, 506-510 

Kohs Block Design Test, 41-42, 82, 558- 
560 

Kuder Preference Record, 412-417 ff., 437, 
448, 450 

Kuder Preference Record—Personal, 496 
Kuder-Richardson foimulas, 141 
Kublmann-Andeison Test, 218-224, 230 

Language, foreign, aptitude for, 320 
Layman, appeal to, 142 
Leaderlcss Group Discussion (LGD), 566- 
567 

Leadership, assessment of, 118, 516, 520, 
566-568, 582-589 

Lee-Thoqie Occupational Interest Inven¬ 
tory, 438 

Leiter International Performance Scale, 207 
Length of test, 130-132 
Lewerenz Test of Fundamental Abilities m 
Visual Art, 317 

Lorge-Thomdike Intelligence Tests, 230 


Machine, test-scoring, 67-69 
Make-a-Picture-Story Test (MAPS), 574 
Manifest Anxiety Scale, Taylor, 451, 469, 
477, 495 

Manual, 100,144 
Manual dexterity, see Dexterity 
Mathematics, prediction of success, 278- 
279 

Maximum performance, 29, 370 
Maze test, 29, 55 
Mean, 78-79 

Mechanical comprehension, 251, 252, 

281 ff ,341 

See also Bennett Test of Mechanical 
Comprehension 
Median, 75 

Medicine, selection and guidance, 338, 352, 
429,486 

Miner Art Judgment Test, 317 
Memory factor, 256 

Mental ability, see General Mental ability, 
Special ability 
Mental age, 168 fl 
Mental deficiency, 169,173, 205 
Menial Measurements Yearbook, 101 
Morrill-Palnier Scale, 207 
Metal Filing Woiksample, 306 
Metropolitan Achievement Tests, 384 
Miller Analogies Test, 231, 581 
Minnesota Cleiicnl Aptitude Test, 306 
Minnesota Counseling Inventory (MCI), 
491,496 

Minnesota Multiphasic Personality Inven¬ 
tory (MMPI), 458, 468, 469-485, 584 
case interpretation, 472, 491 
Minnesota Paper Form Board, 306, 340 
Minnesota Preschool Scale, 183, 207 
Minnesota Rate of Manipulation Test, 306, 
361 

Minnesota Spatial Relations Fonnboard, 
273, 306, 340 

Minnesota Vocational Interest Inventory, 
438 

Modern Language Aptitude Test, 321 
Mooney Problem Check Lists, 487, 497 
Motion-picture as testing medium, 393 
Motivation of poisons tested, 52-61, 441, 
449, 549, 574 

Multiple Aptitude Tests, 292 
Multiple correlation, 339 If 
Multiple cutoir, 342 If. 

Myers-Briggs Inventory, 469 
Matrix test, 215 il 

Navy personnel, prediction of success, 253, 
344,346,361,371,480 
Need for achievement, 572-574 
Neurotic states, 478-485, 562 
Nomination technique, 519 
See also Peer rating 
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Nonverbal tests, 220 
Normal probability curve, 83-85 
Normalized score, 83-85 
Norms, 77, 87-94,102, 221, 385-388 
expectancy, 72, 387 
grade, 385-387 
profile, 285 
Number factor, 256 

Numerical Operations test, 254, 206, 341 

Objective test of personality, 443 

See also Performance test of person¬ 
ality 

Objectives of mstiuction, 368-378, 381 
Objectivity, 23, 65 
Observation, 32, 440-441 
during tests of ability, 187-188,191 
infield situation, 528-538 
as proficiency measure, 393 
m standard situation, 442-443, 539-560 
Observer error, 393, 506-510, 533-535 
Occupational Interest Inventory, 438 
Odd-even reliability, see Split-half 
Office workers, see Clerical workers 
Ohio State Univeisity Psychological Exami¬ 
nation, 72-73, 231 

Operational Stress technique, 540,5,57 
Organic biam damage, 206, 556, 560, 565 
OSS assessment, 567-568, 581-582 
Otis tests of general ability, 220, 221, 222, 
231,306,361 

Painting, as projective technique, 541 
Parallel forms, 145 

Parent Behavior Rating Scales, 526-528 
Patients, 

rating scale, 524-525 
test performance, 187-188, 200-202, 
472-485, 562 
test procedure, 69 

Pattern interpretation, see Configural scor¬ 
ing, Profile 

Peer ra ting, 442, 518-523, 583 
Pencil-paper lest, psychomotor, 309 
Poicontile scale, 74-78, 87 
computing guide, 74-75 
Perception, tests of, pcisonnbty, 541-558, 
560-565 

Perceptual style, 544-546, 579-581 
Fmloim&nce tests, 35 
of mental abiliLy, 192 ff , 202-206 
of personality, 32, 442-443, 539-576 
of proficiency, 354 

Persistence, tests of, 544, 545, 555-556 
Pei sonality, 

m ability tests, 189-191, 200-202, 205 
m interest tests, 428-431 
trait approach to, 466-468, 499-501 
as typical behavior, 31-32, 444—445 


1 

Personality measures, 31-34,440-607 
general principles, 440-462 
performance tests, 539-576 
predictive validity, 485-487, 542 
projective, 560-566, 569-576, 594 
ratings, 506-528 
self-report, 464-504 
stability, 488-489 

See also Assessment 
Personality Record, 507 
Personality stiucture, 32, 500 IT 
Phenomenological psychology, 464 
Pictorial items, 371 
Picture Frustration (PF) Test, 575 
Pilot success, see Air Force 
Pmtner General Ability Tests, 140, 231 
Pmtner-Paterson Scale of Performance *' 
Tests, 208 

Pitch discrimination, 133-134, 344, 372 
Porteus Maze Test, 29, 64 
Power test, 222 
Practice, effects of, 310-312 
Prediction, 17, 325-358 

See also Assessment, Engineering, 
selection and guidance, etc 
Predictive validity, 103-119 
Pie-Engineertng Inventory, 322 
Preschool tests, 183,208-212 
Proscriptive counseling, 297 
Prinuuy mental abilities, 256-258 
Primary Mental Abilities (PMA), Tests of, 
142, 256-258, 292 

Problem solving, 7, 244-243, 544—560 
Process vs product, 20 
Product-moment correlation, 112-115, 
124 

Product rating, 392 
Pioduetvs piocess,26 
Products, rating of, 392 
Professional aptitude tests, 320-322 
Proficiency test, 31, 360—400 
, Profile, 86 

interpretation, 200-202, 284-291, 473- 
476, 481-482 
reliability, 275, 287 
Program tests, 99 

Progressive Matt ices 'Pest, 215-218 
Projective techniques, 26, 443, 542, 560- 
565, 569-576, 579-586, 589, 591-597 
summary list, 574-575 
Psychiatric rating scales, 524 
Psychiatrists, prediction of success, 585-586 
Psychologists, selection and guidance, 425- 
426, 431, 583-585 
Psychometric testing, 24-28 

See also Clinical vs statistical inter¬ 
pretation 

Psychomotor abilities, 301-314 
Psychomotor factors, 307—308 
Psychopathic deviation, 200, 492-493 
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Publisher, 98 
list of, 609 

Purdue Pegboard, 77, 309 
Pursuit Confusion Test, 305 
Pursuit tests, 303-306 

Q-sort, 514-516, 593 
Questionnaires, 34, 405 ff, 

Racial differences in test score, 204 
Range, effect on reliability, 133 
effect on validity, 351 
Rank correlation, 110-111 
Rapport, 38, 60-64, 167,449 
Rating, 506-528 
of products, 392 

Rating scales, 507, 511, 517, 523-528 
Raven Progiessive Matrices Test, 215-218 
Raw score, 69-71 
Reaction tunc, 301,307 
Reading tests, 388-392 
Reading, relation to ability test perform¬ 
ance, 220 

Recognition vs recull, 373 
Recoids, anecdotal, 536-538 
Refusal of tests, 167 
Regents examinations, 396-397 
Reliability, 126-142 
of a dilloienee, 275, 287 
and test length, 130-132 
types ol, 130-142 
Rescan h use of tests, 7, 20,494 
Response sot, see Response style 
Response style, 50,372, 446,450-452 
Retest couelatum, 131, 136, 139-140 
Reviews, test, 101 
Rigidity, tests ol, 516-555,559 
Rod and Frame test, 518,553,555, 558 
Role Concept Ronciloiy (Rep) Test, 504 
Roischuch method, 66, 502-565 
Rotary Pursuit Test, 305,341 

Salesmen, success of, 425, 485-487 
Sample vs. sign, 457 
Sampling of content, 364-360 
Sampling emu, 113 

Sampling, foi obseiv.illon, 140, 529-530, 
540 

Seales, see Rating stales, Handwriting 
Sealtci, 186 

Scatter diagram, 112-114,124, 334 
Schemata, 245 

Schizophrenics, test performance, 187, 201, 
203, 491, 558, 560 

Scholastic aptitude, see General mental 
ability, Giade average, prediction 
Scholastic Aptitude Test (SAT), 37, 57, 
232 

School, achievement, tests of, 360-400 
prediction, 180-181, 205 


School—( Continued) 
testing programs, 394-400 

See also College success, Educa¬ 
tional success 

School and Collogc Ability Tests (SCAT), 
140, 230, 230 

Science, selection and guidance, 431 
Score, cnnecled for guessing, 49 
law, 69-71 
standard, 80-85 
Scoier reliability, 65 
Scoies, combination of, 339-348 
Scoring, 65-69 

Screening, on adjustment, 406, 478-484 
Selection, 18,325-356 

See also Assessment, Aiclutecture, 
Engineering, etc 
Selection ratio, 350 

Selective Sm vice College Qualification 
lest, 89 

Self-concept, 294-296, 454-455, 464 
Sclf-repoit, 34, 442, 461-501 
us self-description, 444-459, 489-490 
Semantic Difleioiitial, 501-504 
Semantic Test of Intelligence, 232 
Sentence Completion Te('lmi([ue, 575 
Sequential testing, 146, 346, 603 
Sequential Tests of Educational Progress 
(STEP), 384 

Shoit Employment Tests, 116,140 
Shopwork, piediction of success, 278, 306, 
310 

pioftciency, 392 
Shrinkage, 355 
Sign vs sample, 457 
Simple strueluie, 255 

Situation, effect on response, 413, 485, 529- 
530, 532 

Situation test, 443 
16 P F Test, 497 
Skewed distillmlton, 84,135 
Skmnei box, 69 

Social class, and ability scores, 237-243 
and interest semes, 123-425 
and motivation, 239 -210 
Social desiialnlity, set Facade 
Social intelligent,!, 189, 319 
Souogiam, 521 
Socionictiie rating, 519-522 
See also l’eei ratings 
Sophistication of pusons' tested, 58 
Spatial ability, 250, 276-281 
Spcm man-Drown formula, 131,141 
Special abilities, 30 

See also Clerical, Number 
Specific factor, 250, 311-312 
Spectrum of general ability tests, 235-237 
Speed, psychomotor, 303, 306, 307 
Speeding, degree of, 221-223, 306 
Split-half method, 141 
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SRA Achievement Senes, 384 
SRA Youth Inventory, 497 
Stability, coefficient of, 136-137, 139-140 
Standard deviation, 78-80 
Standard eiror, 126-127 
Standard score, 80-87,102 
Standardization, 22, 59 
Standardized proficiency test, 394-400 
Stanford Achievement Test, 384, 386 
Stanford-Bmet Scale, 47, 66, 161, 163-189 
evaluation, 189 
Stanine, 82, 

Steadiness, 302, 307, 309 
Stenographic aptitude, 332-333 
Stenography, selection and guidance, 332- 
833 

Store, department, personnel testing, 5, 
360-361 
Strategy, 334 ff 

Stress situation, 540-541, 557, 567-568 
Strong Vocational Interest Blank (SVIB), 
406-412, 416-435, 438, 448, 489, 584 
Stroop Color Word test, 547, 557 
Structuied and unstructuied situations, 540 
Studiousncsskeys, 428 
Study habits, 497 
Study of Values, 35,140,489 
Stylistic tests, 569 
Subtle items, 458, 484 
Survey of Study Habits and Attitudes, 497 
SzondiTest, 575 

T-score, 80-85 

Taxonomy of educational outcomes, 374- 
380 

Taylor Manifest Anxiety Scale, 451, 469, 
477, 495 

Technical Recommendations, 34, 82, 90, 
101-102 

Terman-McNemar Test of Mental Ability, 
220 221 233 

Test, choice’of, 96-100, 142-153, 325- 
358, 498 
definition, 21 

form for examining, 147-152 
free response, 26 

of Mechanical Comprehension, see Me¬ 
chanical compiehcnsion, Bennett Test 
of Mechanical Compiehonsion 
multiscorc, 145, 602 
objective, 23 
recognition, 26 
situation, 443 
standardized, 22, 894-400 
Tester, interaction with subject, 60-64 
qualifications of, 10 167, 4C7-499 
Testing, costs of, 146-147, 309 
Tests, administration, 37-63, 167,185, 
497-499 

catalogs of, 14-15 


Tests—( Continued ) 
classification of, 29 
distribution of, 9 if, 98 
of Primary Mental Abilities, 142,292 
sources of information, 14 
Thematic Apperception Test, 3, 569-572, 
584, 593 

Thematic tests, 569 If , 580 
Tilting Room, 548 
Time limits, 47,145, 221-223 
Time sampling, 530 
Trade test, 282 

Trait approach to personality, 466-468, 
499-501 

Treatments, evaluation of, 19 
True score, 129 ff 
Two-IIand Coordination Test, 305 
Typical perfmmance, tests of, 31, 370, 
403 JT 

Unique factor, 250 

Unstructured and structured situations, 540 
Utility, as function of validity, 348-358 


Valentine Intelligence Tests for Cluldien, 
208 

Validation, 27, 96-124 
Validity, 96-124 
coiicuircnt, 104-119 
construct, 104-106 
content, 104,106, 364-360 
curi leular, 397 
differential, 357 
predictive, 103-119 
relation to reliability, 132 
types of, 103-107 
Validity coefficients, 110, 115-116 
acceptable, 348-358 
Validity generalization, 355 
Variance, 80 

decomposition of, 130, 138, 224 
sources of, 128 

See also Factor analysis 
o eel factor, 252, 253, 260 
Verbal factor, 256 
Vocational Interest Analyses, 438 


War Office selection boards, 143, 581, 5827 
Water Jai Test, 547, 553, 554 J 

Wechsler Adult Intelligence Scale < 

(WAIS), 41, 54, 82,192- ’• 


202,248 \ 

Wechsloi-Bellevue Seale (WB), 191-192/ 
264-267 j 

Wechsler Intelligence Scale for Clnldrefe' 


(WISC), 192-202, 217 


Wittenbom Psychiatric Bating Scales, 524, 
Woodworth Personal Data Slu ct, 465 - 1 1 
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